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Description 

FIELD OF THE INVENTION 

5 This invention provides sets of nucleic acid tags, arrays of oligonucleotide probes, nucleic acid-tagged sets of 

recombinant cells and other compositions, and methods of selecting oligonucleotide probe arrays. The invention relates 
to the selection and interaction of nucleic acids, and nucleic acids immobilized on solid substrates, including related 
chemistry : biology, and medical diagnostic uses. 

10 BACKGROUND OF THE INVENTION 

Methods of forming large arrays of oligonucleotides and other polymers on a solid substrate are known. Pirrung 
era/., U.S. Paient No. 5. 143,854 (see also PCT Application No. WO 90/15070) ; McGaller at., U.S. Patent No. 5,412,087 : 
Chee el al. SN PCT/US 94/1 2305. and Fodor el a/.. PCT Publicalion No. WO 92/10092 describe melhods of forming 

is arrays of oligonucleotides and other polymers using, for example, light-directed synthesis techniques 

In the Fodor et al. publication, methods are described for using computer-controlled systems to direct polymer 
array synthesis. Using the Fodor approach, one heterogenous array of polymers is converted, through simultaneous 
coupling at multiple reaction sites, into a different heterogenous array. See also, Fodor et al (1991) Science, 251: 
767-777; Lipshutz et al (1995) BioTechniques 19(3): 442-447; Fodor et al (1993) Nature 364: 555-556; and Medlin 

*o (1995) Environmental Health Perspectives 244-246. The arrays arc typically placed on a solid surface with an area 
less than 1 inch 2 , although much larger surfaces are optionally used. 

Additional methods applicable to polymer synthesis on a substrate are described, e.g., in US Pat. No. 5,364,261, 
incorporated herein by reference for all purposes. In the methodsdisclosed in these applications, reagents are delivered 
to the substrate by flowing or spotting polymer synthesis reagents on predefined regions of the solid substrate. In each 

25 Instance, certain activated regions of the substrate are physically separated from other regions when the monomer 
solutions are delivered to the various reaction sites ; e.g. : by means of groves, wells and the like. 

Procedures for synthesizing polymer arrays are referred to herein as very large scale immobilized polymer syn- 
thesis (VLSI PS™) procedures. Oligonucleotide VLSIPS™ arrays are useful, 1or instance, in a variety of procedures 
for monitoring test nucleic acids in a sample. In probe arrays with multiple probe sets, many distinct hybridization 

30 interactions can be monitored simultaneously. However, unwanted hybridization between probes, or between probes 
and other nucleic acids, can make analysis of multiple hybridizations problematic. This invention solves these and 
other problems. 

SUMMARY OF THE INVENTION 

With this invention it is now possible to label and detect many individual components present, inter alia, in molecular, 
cellular and viral libraries using a limited number of hybridization conditions. Components are labeled with specially 
selected nucleic acid tags, and the presence of individual tags is monitored by hybridization to a probe array (typically 
a VLSIPS™ array of oligonucleotide probes). Thus, the tag nucleic acids are labels for the individual components, and 

40 the probe array provides a label reader which permits simultaneous detection of a very large number of tag nucleic 
acids. This facilitates massive parallel analysis of all of the components in a mixture in a single assay. 

For instance, as explained herein, all of the members of a cellular library can be tested for response to an envi- 
ronmental stimulus using a mixture of all of the members of the cellular library in a single assay. This is accomplished, 
e.g., by labeling each member of the cellular library, e.g., by cloning a nucleic acid tag into each cell type in the library 

45 mixing each cell type in the library in an appropriate solution, and exposing part of the solution to the selected envi- 
ronmental stimulus. The distribution of nucleic acids in the library before and after the environmental stimulus Is com- 
pared by hybridization of the nucleic acids to a VLSIPS™ array, allowing for detection of cells which are specifically 
affected by the environmental stimulus. 

Accordingly, Ihe presenl invention provides, inter alia, tag nucleic acids, sets of tag nucleic acids, melhods of 

so selecting tag nucleic acids, libraries of cells, viruses or the like containing tag nucleic acids, arrays of oligonucleotide 
probes, arrays of VLSIPS™ probes, methods of selecting arrays of oligonucleotide probes, methods of detecting tag 
nucleic acids with VLSIPS™ arrays and other features which will become clear upon further reading. 

In one class of embodiments, the invention provides a method of selecting a set of tag nucleic acids designed for 
minimal cross hybridization to a VLSIPS™ array. The absence of cross hybridization facilitates analysis of hybridization 

55 patterns to VLSIPS™ arrays, because ft reduces ambiguities in the interpretation of hybridization results which arise 
due to multiple nucleic acid 6pecies binding to a 6ingle species of probe on the VLSIPS™ array Thu6, in the selection 
methods of the invention, pot ntial tags are xcluded from set of tags wher th y bind to the same nucleic acid as 
selected tags under stringent conditions. The selection methods typically Include the steps of selecting a specific ther- 
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mat binding stability for the tag acids against complementary probes, and excluding tags which contain self-comple- 
mentary regions. Often, the thermal binding stability of the lags is selected by specifying parameters which influence 
binding stability such as the length and base composition [e.g., by selecting tags with the same AT to GC ratio of 
nucleotides) for the tag nucleic acids is selected. In this regard, tags which form more GC bonds upon binding a 

s complementary probe require fewer overall bases to have the same binding stability with a complementary probe as 
tags which have fewer GC residues. Binding stability is also affected by base stacking interactions., the formation of 
secondary structures and the choice of solvent in which a tag is bound to a probe. 

The size of the tags can vary substantially: but is typically from about B-150 nucleotides, more typically between 
10 and 100 nucleotides, often between about 15 and 30 nucleotides, generally between about 15 and 25 nucleotides 

70 and, in one preferred embodiment, about 20 nucleotides in length. In a few applications, the tags are substantially 
longer than the probes to which they hybridize. The use of longer tags increases the number of tags from which non- 
cross hybridizing probes can be selected. 

The tag nucleic acids are optionally selected to have constant and variable regions, which facilitates elimination 
of secondary structure arising from self-complemenlarity. and provides structural features for cloning and amplifying 

is the tags. For instance, PCR binding sites or restriction enzyme sites are optionally incorporated into constant regions 
in the tags. In other embodiments, short constant regions are added in coding theory methods to prevent misalignment 
of the tags. Constant regions are optionally cleaved from the tag during processing steps, for instance by cleaving the 
tag nucleic acids with class II restriction enzymes. 

Often it is desirable to eliminate tags which contain runs of 4 nucleotides selected from the group consisting of 4 

20 x residues 4 Y residues and 4 Z residues, where X is selected from the group consisting of G and C, Y is selected 
from the group consisting of G and A, and Z is selected from the group consisting of A and T The elimination of tags 
from a tag set which contain such runs of nucleotides reduces the formation ot secondary structure in the selected 
tags in the tag set. In some embodiments, certain runs are permitted, while others are excluded For instance, in one 
embodiment, runs of 4 A/T or G/C nucleotides are prohibited. 

25 in many embodiments, tags which differ by fewer than about 80% of the total number of nucleotides which comprise 
the tags are excluded. For instance, all selected tags in a selected tag set preferably differ by at least about 4-5 nu- 
cleotides. It is also desirable to exclude lags which share substantial regions of sequence identity, because the regions 
of identity can cross-hybridize to nucleic acids which have subsequence complementary to the region of identity. For 
instance, where 20-mer tags are identical over regions of 9 or more nucleotides, they are typically excluded. 

3D The tags in the tag sets of the invention typically differ by at least two nucleotides, and preferably by 3-5 nucleotides 

for a typical 20-mer A list ot tags which differ by at least two nucleotides can be generated by pairwise comparison of 
each tag, or by other methods. For instance, the tag sequences can be aligned for maximal correspondence and tags 
with a single-mismatch discarded. In one class of embodiments, the number of A+G nucleotides in each of the variable 
regions of each of the tags is selected to be even (or, alternatively, odd), providing a "parity base" or 'error correcting 
base' which provides that each tag have at least two hybridization mismatches between every tag in the tag set, and 
any individual complementary nucleic acid probe (other than the probe which is a perfect complement to the tag). Other 
methods of ensuring that at least two mismatches exist between every tag in a tag set and any individual hybridization 
probe are also appropriate. 

In general, the selection of the tag nucleic acids facilitates selection of the probe nucleic acids, e.g., on VLSIPS™ 

40 arrays used to monitor the tag nucleic acids by hybridization. Specifically, the probes on the array are selected for their 
ability to hybridize to variable sequences in the set of tag nucleic acids (the •Variable" region of a tag which does not 
include a constant region is the entire tag). Thus, all of the rules for selection of tag nucleic acids can be applied to the 
selection of probe nucleic acids, for example by performing the tag selection steps and then determining the comple- 
mentary 6et of probe nucleic acids. 

<s In another class of embodiments, the invention provides compositions comprising sets of tag nucleic acids, which 

include a plurality of tag nucleic acids. In preferred embodiments, the set of tag nucleic acids comprises from 
100-100,000 tags. Typically, a tag set will include between about 500 and 15,000 tags. 5sualty, the number of tags in 
a tag set is between about 5,000 and about 14,000 tags. In one preferred embodiment, a set of tags of the invention 
comprises about 8 : 000-9,000 lags. The lag sequences lypically comprise a variable region, where the variable region 

so for each tag nucleic acid in the set of tag nucleic acids has the same the same G-fC to A«T ratio, approximately the 
same T m , the same length and do not cross-hybridize to a single complementary probe nucleic acid. Most typically, 
the tag nucleic acids in the set of tag nucleic acids cannot be aligned with less than two differences between any two 
of the tag nucleic acids in the set of tag nucleic acids, and often at least 5 differences exist between any pair of tags 
in a tag set In one embodiment, the tags also comprise a constant region such as a PCR primer binding site for 

& amplification of the tag. 

In one class of embodiments, the invention provides a method of labeling a composition, comprising associating 
a tag nucleic acid with the composition, wherein the tag nucleic acid is selected from a group f tag nucleic acids which 
do not cross-hybridize and which have a substantially similar T m . Typically, the tag labels are detected with a VLSIPS™ 
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array which comprises probes complementary lo the tags used to label the composition. 

As described herein, preferred composilions include constituents of cellular, viral or molecular libraries such as 
recombinant cells : recombinant viruses or polymers. However, one of skill will readily appreciate that other compositions 
can also be labeled using the nucleic acid tags of the invention, and the tags detected using VLSIPS™ arrays. For 
s instance, high denomination currency can be labeled with a set of nucleic acid tags, and counterfeits detected by 
monitoring hybridization of a wash of the currency (or. e.g., a PCR amplification of attached nucleic acids which encode 
tag sequences) with an appropriate VLSIPS™ array. 

In another class of embodiments, the invention provides methods of pre-selecling experimental probes in an oli- 
gonucleotide probe array, wherein the probes have substantially uniform hybridization properties and do not cross 
to hybridize to a target tag nucleic acid. In the methods, a ratio of G+C to A+T nucleotides shared by the experimental 
probes in the array is selected and all possible 4 nucleotide subsequences for the probes of the array are determined. 
All potential probes from the array which contain prohibited 4 nucleotide sub-sequences are excluded from the exper- 
imental probes of the array. 4 nucleotide subsequences are prohibited when the nucleotide subsequences are selected 
from the group consisting of self-complemenlary probes, A4 probes, T 4 probes, and [G,C] 4 probes. Also, where the 
is target tag nucleic acid comprises a constant region, all probes complementary to the constant region sub-sequence 
of the target tag nucleic acid are prohibited, and not present in the tag set. Typically, a length for the probes in the array 
is selected, although non-hybridizing portions of the probe (i.e., nucleotides which do not hybridize to a target nucleic 
acid) optionally vary between different classes of probes. "Experimental probes' hybridize to a target tag nucleic acid, 
while "control" probes either do not hybridize to a target tag nucleic acid, or bi nd to a n ucleic acid which has hybridization 
properties which differ from those of the target tag nucleic acids in a tag nucleic acid set. For instance, control probes 
are optionally used in VLSIPS™ arrays to check hybridization stringency against a known nucleic acid. 

In one class of methods of the invention, a plurality of test nucleic acids are simultaneously detected in a sample. 
In the methods, an array of experimental probes which do not cross hybridize to a target under stringent conditions is 
used to detect the target nucleic acids. Typically, the ratio of G+C bases in each experimental probe is substantially 
2$ identical. The probes of the array are arranged into probe sets in which each probe set comprises a homogeneous 
population of oligonucleotide probes. For example, many individual probes with the same nucleotide sequence are 
arranged in proximity to one another on the surface of an array to form a particular geometric shape. Probe sets are 
arranged in proximity to each other to form an array of probes. For instance, where the probe array is a VLSIPS™ 
array the probe sets are optionally arranged into squares on the surface of a substrate, forming a checkerboard pattern 
30 of probe sets on the substrate. 

The probes of the array specifically hybridize to at least one test nucleic acid in the samp le under stringent hybrid- 
ization conditions. The method further comprises detecting hybridization of the test nucleic acids to the array of oligo- 
nucleotide probes. Typically, the test nucleic acids comprise tag sequences, which bind to the experimental probes of 
the array. 

3$ In one class of embodiments, the invention provides an array of oligonucleotide probes comprising a plurality ol 

experimental oligonucleotide probe sets attached to a solid substrate, wherein each experimental oligonucleotide probe 
set in the array hybridizes to a different target nucleic acid under stringent hybridization conditions. Each experimental 
oligonucleotide probe in the probe sets of the array comprises a constant region and a variable region. The variable 
region does not cross hybridize with the constant region under stringent hybridization conditions, and the nucleic acid 

40 probes do not cross-hybridize to target nucleic acids. Typicalfy the probes from each probe set differ from the probes 
of every other probe set in the array by the arrangement of at least two nucleotides in the probes of the probe set. 
Generally, the ratio of G+C bases in each probe for each experimental probe set is substantially identical (meaning 
that the G+C ratio does not vary by more than 5%), thereby assuring that they hybridize to a target with similar avidity 
under similar hybridization conditions. The array6 optionally comprise control probes, e.g.. to assess hybridization 

«5 conditions by monitoring binding of a known quantitated nucleic acid to the control probe. 

In another class of embodiments, the Invention provides a plurality of recombinant cells or recombinant viruses 
comprising tag nucleic acids, which tag nucleic acids comprise a constant region and a variable region. Typically, the 
variable region for each tag nucleic acid In the set of tag nucleic acids has approximately the same T m , {e.g., the same 
G+C lo A+T ratio and Ihe same length) and does not cross-hybridize lo a probe nucleic acid. Different lag nucleic acids 

so in the set of tag nucleic acids found In the different recombinant cells cannot be aligned with less than two differences 
between the tag nucleic acids. Generally, the recombinant cells are selected from a library of genetically distinct re- 
combinant cells (eukaryotic. prokaryotic or archaebacterial) or viruses. For example, in one class of preferred embod- 
iments, the cells are yeast cells. In another class of preferred embodiments, the cells are of mammalian origin. 
The present invention provides arrays of oligonucleotides attachedto solidsubstrates. Typically, the oligonucleotide 

$5 probes in the array are arranged Into probe sets at defined locations in the array to enhance signal processing of 
hybridization reactions between the oligonucleotide probes and test nucleic acids in a sample. The oligonucleotide 
arrays can have virtually any number f different ligonucleotide sets, determined largely by the number or variety of 
lest nucleic acids or nucleic acid tags to be screened against the array In a given application. In one group of embod- 
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iments, the array has from 1 0 up to 100 oligonucleotide sets. In other groups ot embodiments, the arrays have between 
100 and 10,000 sels. In certain embodiments, Ihe arrays have be-ween 10,00,0 and 100,000 sets, and in yet other 
embodiments the arrays have between 100,000 and 1,000,000 sets Most preferred embodiments will have between 
7,500 and 12,500 sets. For example in one preferred embodiment, the arrays will comprise about 3.000 sets of oligo- 
nucleotide probes. In preferred embodiments, the array will have a density ot more than 100 sets f oligonucleotides 
at known locations per cm 2 , or more pr f rably, more than 1000 sets per cm 2 . In some embodim nts, the arrays have 
a density of more than 10,000 sets per cm 2 . 

The present invention also provides kits embodying the inventive concepts outlined above. For example : kits of 
the invention comprise any of the arrays, cells, libraries or tag sets described herein. Also, because the methods of 
using the arrays and tags optionally include PCR, LCR and other in vitro amplification techniques for amplifying tag 
nucleic acids, the kits of the invention optionally include reagents for practicing In vitro amplification methods such as 
fag polymerase, nucleotides, computer software with tag selection programs and the like. The kits also optionally 
comprise nucleic acid labeling reagents, instructions, containers and other items that will be apparent to one of skill 
upon further reading. 

BRIEF DESCRIPTION OF THE DRAWING 

Figure 1 is a Scanned image of a 1.28cm by 1.28cm high-densfty array hybridized with a fluorescently labeled 
control oligonucleotide. The array contains complementary sequences to 4,500 20mer tags selected as described in 
Tabic 1. Control oligonucleotides arc synthesized in the comers and in a cross-hair pattern across the array to verify 
the uniformity of the synthesis and the hybridization conditions. 'DNA TAGS' was spelled out with control oligonucle- 
otides as well. The dark areas Indicate the location of the 4,500 20 base molecular tags. Note that there is no cross- 
hybridization of the control oligonucleotide and the molecular tag sequences. 

Figure 2 shows a PCR-targeting strategy used to generate tagged deletion strains, (a) The ORF is identified from 
the sequence information in the database. Regions immediately flanting the ORF are used to generate the deletion 
strain, (b) The selectable marker (kan^ is amplified using a pair of long primens to generate an ORF specific deletion 
construct. The up-stream Bomer primer consists of (5' to 3'): 30 bases of yeast homology, an 18 base common tag 
priming site, a 20 base molecular tag, and a 22 base sequence that is homologous to one side of the marker. The 
down-stream oligonucleotides consists of 50 bases of yeast homology to the other side of the targeted ORF and 16 
bases that are homologous to the other side of the marker. The dashed lines representing the long oligonucleotides 
illustrates that the primers are unpurified and are missing sequence on the 5* end. (c) A second round of PCR with 
20mcrs homologous to the ends of the initial PCR product was used to •flush" the ragged ends generated by unpurified 
oligonucleotide in the first round, (d) The resulting marker flanked by yeast ORF homology on either 6ide is transformed 
directly into haploid yeast strain and homologous recombination results in the replacement of the targeted ORF with 
the marker, 20mer tag, and tag priming site. 

Figure 3 shows oligonucleotides used to generate the ADE 1 tagged deletion strain. Similar sets of oligonucleotides 
were synthesized for the other ten auxotrophic ORFS 

Figure 4 shows transformation results and tag information for eleven auxotrophic ORFS. Eight colonies from each 
transformation were analyzed by replica plating and PCR the resulting targeting efficiency is shown for each of the 
ORFS. The sequence and x,y coordinates are shown for the molecular tags that were used to uniquely label the different 
deletion strains. 

Figure 5 shows the Tag amplification strategy described in Example 1 . (a) A deletion pool was generated by com- 
bining equal numbers of the eleven tagged deletion strains described in Figure 3. Genomic DNA isolated from a rep- 
resentative aliquot of the pool was U6ed as template for a tag amplification reaction, (b) Tags were amplified using a 
single pair of primers that are homologous to the common priming sites which flank each tag. One of the common 
primers is labeled with 5' fluorescein and included in a 10-fold excess over the unlabeled primer, (c) The asymmetric 
nature of the PCR generates a population of single-stranded fluorescently labeled 60mer tag amplicons that are directly 
hybridized the high-density 20mer array which is then washed and scanned, (d) An actual scanned Image of the array 
shows the (predicted) hybridization pattern for the lags with virtually no cross-hybridization on the rest of ihe chip. A 
closeup view of the left hand comer shows the location of the tags for each of the different deletion strains. 

Figure 6 shows Ihe analysis of a deletion pool containing 11 tagged auxotrophic deletion strains. A deletion pool 
was generated by combining equal numbers of cells from each of the 11 deletion strains described in Figure 3. Rep- 
resentative aliquots were grown In (A) complete media (SDC) : (B) media missing adenine (SDC-ADE), (C) or media 
missing tryptophan (SDC-TRP). Cells were harvested at the indicated time points and cenomic DNA was isolated. 
Tags were amplified from the genomic DNA and labeled amplicons were directly hybridized to the high-density array 
for 30 minutes, washed, and scanned, A blowup of the upper left hand corner for each of the 6cans is shown. 
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DEFINITIONS 

Unless defined otherwise, technical and scientific terms used herein have the same meaning as commonly un- 
derstood by one of ordinary skill in the art to which this invention belongs. Singleton el at. (1994) Dictionary ol Micro- 
s biology and Molecular Biology second edition, John Wiley and Sons (New York) and March (March, Advanced Organic 
Chemistry Reactions, Mechanisms and Structure 4th ed J. Wiley and Sons (New York, 1 992) provides one of skill with 
a general guide to many ol the terms used in this invention. 

Although one of skill will recognize many methods and materials similar or equivalent to those described herein 
which can be used in the practice of the present invention the preferred methods and materials are described. For 
io purposes of the present invention, the fol towing terms are defined below. 

"Eukaryotic" cells are cells which contain at least one nucleus in which the cell's genomic DNA Is organized, or 
which are the differentiated offspring of cells which contained at least one nucleus. Eukaryotes are distinguished from 
prokaryotes which are cellular organisms which carry their genomic DNA in the cell's cytoplasm. 

A "nucleoside" is a pentose glycoside in which the aglycone is a heterocyclic base; upon the addition ol a phosphate 
is group the compound becomes a nucleotide. The major biological nucleosides are p-glycoside derivatives of D-ribose 
or D-2-deoxyribose. Nucleotides are phosphate esters of nucleosides which are acidic due to the hydroxy groups on 
the phosphate. The polymerized nucleotides deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) store the genetic 
information which controls all aspects of an organism's interaction with its environment. The nucleosides of DNA and 
RNA are connected together via phosphate units attached to the 3 position of one pentose and the 5 position of the 
*3 next pentose. 

A 'nucleic acid" is a deoxyribonucleotide or ribonucleotide polymer in either single- or double-stranded form, and 
unless otherwise limited, encompasses known analogs of natural nucleotides that function in a manner similar to nat- 
urally occurring nucleotides. 

An ■oligonucleotide' is a nucleic acid polymer composed of two or more nucleotides or nucleotide analogues. An 

25 oligonucleotide can be derived from natural sources but is often synthesized chemically It is of any size. 

An 'oligonucleotide array" is a spatially defined pattern of oligonucleotide probes on a solid support. A 'preselected 
array of oligonucleotides" is an array of spatially defined oligonucleotides on a solid support which is designed before 
being constructed (i.e., the arrangement of polymers on solid substrate during synthesis is deliberate, and not random). 
A 'nucleic acid reagent" utilized in standard automated oligonucleotide synthesis typically caries a protected phos- 

30 phate on the 3' hydroxy I of the ribose. Thus, nucleic acid reagents are referred to as nucleotides, nucleotide reagents, 
nucleoside reagents, nucleoside phosphates, nucleoside-3-phosphates, nucleoside phosphoramidites', phosphora- 
miditcs, nucleoside phosphonatcs, phosphonatcs and the like. It is generally understood that nucleotide reagents carry 
a protected phosphate group in order to form a phosphodiester linkage. 

A 'protecting group" as used herein, refers to any of the groups which are designed to block one reactive site in 

35 a molecule while a chemical reaction is carried out at another reactive site. More particularly the protecting groups 
used herein can be any of those groups described in Greene, et a/., Protective Groups in Organic Chemistry, 2nd Ed. ; 
John Wiley & Sons, New York, NY, 1991, which is incorporated herein by reference. The proper selection of protecting 
groups for a particular synthesis is governed by the overall methods employed in the synthesis. For example : in 'light- 
directed" synthesis, discussed herein, the protecting groups are typically photolabile protecting groups such as NVOC, 

4i MeNPoc : and those disclosed In co-pending Application PCT/US93/10162 (filed October22, 1993), incorporated herein 
by reference. In other methods, protecting groups are removed by chemical methods and include groups such as 
FMOC, DMT and others known to those of skill in the art. 

A "solid substrate" has a fixed organizational support matrix, such as silica, polymeric materials, or glass. In some 
embodiments, at least one surface of the substrate is partially planar. In other embodiments it is desirable to physically 

<s separate regions of the substrate to delineate synthetic regions, for example with trenches, grooves : wells or the like. 
Example of solid substrates Include slides, beads and polymeric chips. A solid support is f uncttonalized" to permit the 
coupling ol monomers used in polymer synthesis. For example, a solid support Is optionally coupled to a nucleoside 
monomer through a covalent linkage to the 3'-carbon on a furanoso. Solid support materials typically are unreactrve 
during polymer synthesis, providing a substratum to anchor Ihe growing polymer. Solid supporl materials include, but 

5J are not limited to. glass, silica, controlled pore glass (CPG), polystyrene, polystyrene/latex, and carboxyl modified 
Teflon. The solid substrates are biological, non biological, organic, inorganic, or a combination of any of these, existing 
as particles, strands, precipitates, gels, sheets, tubing, spheres, containers, capillaries, pads, slices, films, plates, 
slides, etc. depending upon the particular application. In light-directed synthetic techniques, the solid substrate is often 
planar but optionally takes on alternative surface configurations. For example, the solid substrate optionally contaiis 

55 raised or depressed regions on which synthesis takes place. In some embodiments, the solid substrate Is chosen to 
provide appropriate light-absorbing characteristics. For example, the substrate may be a polymerized Langmuir Blod- 
gett film, functional d glass, Si, Go, GaAs, GaP, SiOj, Si N 4: modified silicon, r any one of a variety fgels r polymers 
such as (pofy)tetrafluoroethy!ene t (poty)viny1idendifluoride t polystyrene, polycarbonate, or combinations thereof. Other 



6 



EP0 799 897 A1 



suitable solid substrate materials will be readily apparent to those of skill in the art. Preferably the surface of the solid 
substrate will contain reactive groups, such as carboxyl, amina hydroxyl. thiol, or the like. More preferably the surface 
is optically transparent and has surface Si-OH functionalities, such as are found on silica surfaces. A substrate is a 
material having a rigid or semi-rigid surface. In spotting or flowing VLSIPS™ techniques, at least one surface of the 

s solid substrate is optionally planar, although in many embodiments it is desirable to physically separate synthesis 
regions for different polymers wrth : for example, wells, raised regions, etched trenches, or the like. In some embodi- 
ments, the substrate itself contains wells, trenches, flow through regions, etc. which form all or part of the regions upon 
which polymer synthesis occurs. 

The term "recombinant- when used with reference to a cell or virus indicates that the cell or virus encodes a DN A 

10 or RNA whose origin is exogenous to the cell or virus. Thus, for example, recombinant cells optionally express nucleic 
acids {e.g. : RNA) not found within the native (non-recombinant) form of the cell. 

"Stringent" hybridization conditions are sequence dependent and will be different with different environmental pa- 
rameters (salt concentrations, presence of organics ere). Generally stringent conditions are selected to be about 5° 
C to 20°C lower than the thermal melting point (T m ) for the specific nucleic acid sequence at a defined bnic strength 

is and pH. Preferably, stringent conditions are about 5° C to 10* C lower than the thermal melting point for a specific 
nucleic acid bound to a complementary nucleic acid. The T m is the temperature (under defined ionic strength and pH) 
at which 50% of a nucleic acid (e.g., tag nucleic acid) hybridizes to a perfectly matched probe. "Thermal binding stability' 
is a measure of the temperature-dependent stability of a nucleic acid duplex in solution. The thermal binding stability 
for a duplex is dependent on the solvent, base composition of the duplex, number and type of base pairs, position of 

20 base pairs in the duplex, length ot the duplex and the like. 

"Stringent" wash conditions are ordinarily determined empirically for hybridization of each set of tags to a corre- 
sponding probe array. The arrays are first hybridized (typically under stringent hybridization conditions) and then 
washed with buffers containing successively tower concentrations of salts, and/or higher concentrations of detergents, 
and/or at increasing temperatures until the signal to noise ratio for specific to non-specific hybridization is high enough 

25 to facilitate detection of specific hybridization. 

Stringent temperature conditions will usually include temperatures in excess of about 30*C : more usually in excess 
of about 37° C. and occasionally in excess of about 45°C. Stringent salt conditions will ordinarily be less than about 
1000 mM, usually less than about 500 mM, more usually less than about 400 mM, typically less than about 300 mM, 
preferably less than about 200 mM, and more preferably less than about 150 mM. However the combination of pa- 

30 rameters is more important than the measure of any single parameter. See, e.g., Wetmur and Davidson (1 96B) J. Mol. 
Biol. 31:349- 370 and Wetmur (1991) Critical Reviews in Biochemistry and Molecular Sib/opy 26 (3/4), 227-259. 

The term "identical" in the context of two nucleic acid sequences refers to the residues in the two sequences which 
are the same when aligned for maximum correspondence. Optimal alignment of sequences for comparison can be 
conducted, e.g., by the local homology algorithm ol Smith and Waterman Adv. Appi. Math. 2: 482 (1981), by the ho- 

3S mclogy alignment algorithm of Needleman and Wunsch J. Mot. Biol 48:443 (1970), by the search for similarity method 
of Pearson and Lipman Proc. Natl Acad. Sci. (U.S.A) 85: 2444 (1988) : by computerized Implementations of these 
algorithms (GAP, BESTFIT, FASTA, and TFASTA in the Wisconsin Genetics Software Package, Genetics Computer 
Group : 575 Science Dr. ; Madison, Wl) : or by inspection. 

A nucleic acid ■tag 1 is a selected nucleic acid with a specified nucleic acid sequence. A nucleic acid "probe 1 hy- 

40 bridizes to a nucleic acid "tag." In one typical configuration: nucleic acid tags are Incorporated as labels Into biological 
libraries, and the tag nucleic acids are detected using a VLSIPS™ array of probes. Thus, the tag' nucleic acid functions 
in a manner analogous to a bar code label, and the VLSIPS™ array of probes functions in a manner analogous to as 
a bar code label reader. A "list of tag nucleic acids" is a pool of tag nucleic acids, or a representation [i.e., an electronic 
or paper copy) of the sequences in the pool of tag nucleic acids. The pool of tags can be, for instance, all possible tags 

45 of a specified length (i.e., all 20-mers), or a subset thereof. 

A set of nucleic acid tags binds to a probe with 'minima! cross hybridization' when a single species (or type') of 
tag in the tag set accounts tor the majority of all tags which bind to an array comprising a probe species under stringent 
conditions. Typically, about 80% or more of the tags bound to the probe species are of a single species under stringent 
condilions. Usually about 90% or more of the Lags bound to the probe species are ol a single species under slringenl 

so conditions Freferably 95% or more ol the tags bound to tha probe species are of a single species under stringent 
condilions. 

DETAILED DESCRIPTION OF THE INVENTION 

55 The invention provides methods of selecting and detecting sets of tag nucleic acids. In addition, the invention 
provides arrays ol probe nucleic acids for detecting tag nucleic acids, sets of tag nucleic acids and cells which comprise 
tag nucleic acids. The tag nucleic acids, probe nucleic acid arrays and cells trans! rmed with tag nucleic acids have 
a variety of uses. Most commonly, the tag nucleic acids of the Invention are used to tag ceils with known genotypic 
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markers (mutants : polymorphisms, etc.), and to track the effect of environmental changes on the viability of tagged cells. 

For instance, wilh the completion of the sequencing or S. cerevisiae, thousands of open reading frames (ORFs) 
have been identified. One strategy for identifying the function of the identified ORFs is to create deletion mutants for 
each ORF, followed by analysis of the resulting deletion mutants under a wide variety of selective conditions. Typically, fy.jy^ 
s the goal of such an analysis is to identily a phenotype that reveals the function of the missing ORF If the analysis were 

to be carried out for each deletion mutant in a separate xperiment, the required time and cost for monitoring the elf ct ^ 
of altering an environmental parameter on each deletion mutant would be prohibitive. For instance, to identify ORFs 
which are required for synthesis of an amino acid, all of the thousands of ORF deletion mutants would be individually 
tested for the ability of the mutant to grow in media lacking the amino acid. Even if the analysis were carried out in a Z&a& "V >* 
id parallel fashion using, e.g., 96-welt plates, the effort required to plate, organize, label and track each clone would be 
prohibitive. The present invention provides a much more cost-effective approach to screening cells. 

In the methods of the invention all of the thousands of deletion mutants described above can be tested in parallel 
in a single experiment. The deletion mutants are each tagged with a tag nucleic acid, and the deletion mutants are 
then pooled. The pooled, lagged deletion mutants are then simultaneously tested lor their response to an environmental 
is stimulus (e.g., growth in medium lacking an amino acid). The deletion cell-specific tags are then read using a probe 
array such as a VLSIPS™ array. Thus, by analogy, the deletion cell-specific nucleic acid tags act as bar code labels 
for the cells and the VLSIPS™ array acts as a bar code reader. 

While the example above specifically discusses labeling yeast cells, one of skill will appreciate that essentially any / K , t p 
tvnA ftflri he labeled with the nucleic acid taos of the invention, includina orokarvotes. eukarvotes. and archaebac- A" ' 



ulC 
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cell type can be labeled with the nucleic acid tags of the invention, including prokaryotes, eukaryotes, and archaebac- 
*o tcria. Also, essentially any virus can be similarly labeled, as can cellular organelles with nucleic acids (mitochondria, 
chloroplasts, eta). In fact, labeling by tag nucleic acids and detection by probe arrays is not in any way limited to 
biological materials. One of skill will recognize that many other compositions can also be labeled by nucleic acid tags 
and detected with probe arrays. Essentially anything which benefits from the attachment of a label can be labeled and 
detected by the tags, arrays, and methods of the invention. For instance, high denominational currency, original works 
2s of art, valuable stamps, significant legal documents such as wills, deeds of property, and contracts can all be labeled 
with nucleic acid tags and the tags read using the probe arrays of the invention. Methods of attaching and cleaving 
nucleic acids to and from many substrates are well known in the art, including glass, polymers, paper, ceramics and 
the like, and these techniques are applicable to the nucleic acid tags of the invention. 

One of skill will also appreciate that while many of the examples herein describe the use of a single tag nucleic 
30 acid to label a cell, multiple tags can also be used to label any cell, e.g., by cloning multiple nucleic acid tags into the 
cell. Similarly, multiple nucleic acid lags can be used to label a substance such as those described above. Indeed, 
multiple labels arc typically preferred where the object of the nucleic acid tags is to detect forgery. For instance, the 
nucleic acids of the invention can be used to label a high denomination currency bill with hundreds : or even thousands, 
of distinct tags, such that visualization of the hybridization pattern of the tags on a VLSI PS™ array provides verification 
3S that the currency bill is genuine. 

In certain embodiments, multiple probes bind to unique regions on a single tag In these embodiments, the probes ^ t j t/ fa 

are typically relatively large, e.g., about 60 nucleotides or longer. Probes are selected such that each probe binds to fls*^' /' 

a single region on a single tag. The use of tags which bind multiple probes increases the informational content of 
hybridization reactions by providing making it possible to monitor multiple hybridization events simultaneously. 
40 one of skill will also appreciate that It Is not necessary to hybridize a tag directly to a probe array to achieve 
essentially the same effect. For inslance. tag nucleic acids are optionally (and preferably) amplified, e.g.. using PCR 
or LCR or other known amplification techniques, and the amplification products ('amplicons") hybridized to the array. 
For Instance, a nucleic acid tag optionally includes or is in proximity to PCR primer binding sites which, when amplified 
using standard PCR techniques, amplifies the tag nucleic acid, or a subsequence thereof. Thus, cells, or other tagged 
45 Herns can be detected even if the tag nucleic acids are present in very small quantities. One of skill will appreciate that 
a single molecule of a nucleic acid tag can easily be detected after amplification, e.g., by PCR. The complexity reduction 
from amplifying a selected mixture of tags {i.e., there are relatively few amplicon nucleic acid species as compared to 
a pool of genomic ON A) facilitates analysis of the mixture of tags. 

in one preferred embodiment, tags are selected such that each selected lag has a complementary selected-tag. 
so For example, If a tag is cloned Into an organism, the tag can be amplified using LCR, PCR or other amplication 
methods. The amplified tag is often double-stranded. In preferred embodiments, tag sets which include complementary 
sets of tags have corresponding probes for each complementary tag. Both strands of a double-stranded tag amplifi- 
cation product are separately monitored by the probe array. Hybridization of each of the strands of the double-stranded 
tag provides an independent readout for the presence or absence of the tag nucleic acid in a sample. 

ss 

Selection c! Tea nucleic gc/cte. 

This invention provides ways of selecting nucleic acid tag sets useful for labeling cells and other compositions as 
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described above. The tag sets provided by the selection methods of the invention have unit rm hybridization charac- 
teristics {i.e., similar thermal binding stability to complementary nucleic acids), making the lag sets suitable lor detection 
by VLSIPS™ and other probe arrays, such as Southern or northern blots. Because the hybridation characteristics o1 
the tags are uniform, all of the tags in the set are typically detectable using a single set of hybridization and wash 

£ conditions. As described in the Examples below, various selection methods of the invention were used to gen rate 
lists of about 10,000 suitable 20-mer nucleic acid tags from all of possible 20-mer sequences (about 
1 ,200.000.000,000). The synthesis of a single array with 10,000 probes complementary to the 10,000 tag nucleic acids 
{i.e., for the detection of the tags) was carried out using standard VLSIPS™ techniques to make a VLSIPS™ array. 
Desirable nucleic acid tag sets have several properties. These include, interatia, that the hybridization of the tags 

io to their complementary probe {i.e., in the VLSIPS™ array) is strong and uniform; that individual tags hybridize only to 
their complementary probes, and do not significantly cross hybridize with probes complementary to other sequence 
tags; that it there are constant regions associated with the tags [e.g., cloning sites, or PCR primer binding sites) that 
the constant regions do not hybridize to a corresponding probe set. If the selected tag set has the described properties, 
any mixture ol lags can be hybridized to a corresponding array, and Ihe absence or presence of the lag can be unam- 

ts biguously determined and quantitated. Another advantage to such a tag set is that the amount of binding of any tag 
set can be quantitated, providing an indication of the relative ratio of any particular tag nucleic acid to any other tag 
nucleic acid in the tag set. 

The properties outlined above are obtained by following some or all of the selection steps outlined bebw for se- 
lection of tag sequence characteristics. 

20 (1 ) Determine all possible nucleic acid tags of a selected length, or with selected hybridization properties. Although 

the examples below provide ways of selecting tags from pools of tags of a single length for illustrative purposes, one 
of skill will appreciate that the tags can have different lengths, e.g., where the tags have the same (or closely similar) 
melting temperatures against perfectly complementary targets. One of skill will also appreciate that a subset of all 
possible tags can be used ; depending on the application. For instance, where tags are used to detect an organism, 

25 20 mers which either occur in the organism's genome, or do not occur in the organism's genome can be used as a 
stalling point for a pooi of potential nucleic acid tags. For instance, the entire genome of S. cerevisiae is available. In 
certain embodiments, endogenous sequences (e.g., those which appear in ORFs) can be used as lags, obviating the 
need for cloning tags into the organism Thus, all possible 20-mers from the genome are determined from the genomic 
sequence, and 20-mers with desired hybridization characteristics to probes are selected. Alternatively, where tags are 

30 cloned into the organism, it is occasionally preferable to eliminate all 20-mers which appear naturally in the genome 
from consideration as tag sequences, so that introduced tag sequences are not confused with endogenous sequences 
in hybridization assays. 

The selection of the length of the nucleic acid tag Is dependent on the desired hybridization and discrimination 
properties of the probe array for detecting the tag. In general, the longer the tag, the higher the stringency of the 

3$ hybridization and washes of the hybridized nucleic acids on the array. However, longer tags are not as easily discrim- 
inated on the array, because a single mismatch on a long nucleic acid duplex has less of a destabilizing effect on 
hybridization than a single mismatch on a short nucleic acid duplex. It is expected that one of skill is thoroughly familiar 
with the theory and practice of nucleic acid hybridization to an array of nucleic acids. In addition to the patents and 
literature cited supra in regards to the synthesis of VLSIPS™ arrays, Gait, ed. Oligonucleotide Synthesis: A Practical 

40 Approach, IRL Press, Oxford (1984); W.H.A. Kuijpers Nucleic Acids Research 18(17), 5197 (1994); K.L. Dueholm J. 
Org. Chem. 59. 5767-5773 (1994); S. Agrawal (ed.) Methods in Molecular Biology, volume 20; and Tijssen (1993) 
Laboratory Techniques in biochemistry end molecular biology-hybridization with nucleic acid probes, e.g., part I chapter 
2 "overview of principles of hybridization and the strategy of nucleic acid probe assays". Elsevier, New York provide a 
basic guide to nucleic acid hybridization. 

4$ Most typically, tags are between 8 and 100 nucleotides in length, and preferably between about 10 and 30 nucle- 
otides in length. Most preferably, the tags are between 1 5 and 25 nucleic acids in length. For example : in one preferred 
embodiment: the nucleic acid tags are about 20 nucleotides in length. 

(2) The tags are selected so that there is no complementarity between any of the probes in an array selected to 
hybridize lo Ihe lag set and any constant lag region (constant lag regions are optionally provided to provide primer 

so binding sites, e.g. for PCR amplification of the remainder of the tag, or to limit the selected tags as described below). 
In other words, the complementary nucleic acid of the variable region of a nucleic acid tag cannot hybridize to any 
constant region of the nucleic acid tag. One of skill will appreciate that constant regions in tag sequences are optional, 
typically being used where a PCR or other primer binding site is used within the tag. 

(3) The tags are selected so that no tag hybridizes to a probe with only one sequence mismatch (all tags differ by 
55 at least two nucleotides). Optionaily : tags can be selected which have at least 2 mismatches. 3 mismatches, 4 mis- 
matches 5 mismatchesor more to a probe which is not perfectly complementary to the tag, depending on the application. 
Typically, all tag sequences are selected to hybridize nfy to a perfectly c mplemontary probe, and the nearest mis- 
match hybridization possibility has at least two hybridization mismatches. Thus, the tag sequences typically differ by 
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at least two nucleotides when aligned for maximal correspondence Preferably, the tags differ by about 5 nucleotides 
when aligned for maximal correspondence {e.g., where Ihe lags are 20-mers). 

The tags are often selected so that they do not have identical runs of nucleotides of a specified length. For instance, 
where the tags are 20-mers. the tags are preferably selected so that no two tags have runs of about 9 or more nucle- 
otides in common. One of skill will appreciate that the length of prohibited identity varies depending on the selected 
length of the tag. It was empirically determined that cross-hybridization occurs in tag sets when 20-mer tags have more 
than about 8 contiguous nucleotides in common. 

(4) The tags are selected so that there is no secondary structure within the complementary probes used to detect 
the tags which are complementary to the tags. Typically, this is done by eliminating tags from a selected tag set which 
havQ eubsequences of 4 or mora nucleotides which are complementary. 

(5) The tags are selected so that no secondary structure forms between a tag and any associated constant se- 
quence. Self-complementary tags have poor hybridization properties in arrays, because the complementary portions 
of the probes (and corresponding tags) self hybridize [i.e., form hairpin structures). 

(6) The lags are selected so lhal probes complementary to the lags do not hybridize to each other, thereby pre- 
venting duplex formation of the tags in solution. 

(7) If there is more than one constant region in the tag. the constant regions of the tag are selected so that they 
do not self -hybridize or form hairpin structures. 

(8) Where the tags are of a single length; the tags are selected so that they have roughly the same, and preferably 
exactly the same overall base composition (i.e., the same A+T to G+C ratio of nucleic acids). Where the tags are of 
differing lengths, the A+T to G+C ratio is determined by selecting a thermal melting temperature for the tags, and 
selecting an A+T to G+C ratio and probe length for each tag which ha6 the selected thermal melting temperature. 

One of skill will recognize that there are a variety of possible ways of performing the above selection steps. Most 
typically, selection steps are performed using simple computer programs to perform the selection in each of the steps 
outlined above; however, all of the steps are optionally performed manually. The following strategies are provided for 
exemplary purposes; one of skill will recognize that a variety of similar strategies can be used to achieve similar results. 

In one embodiment, secondary structure was prevented within the tag : and hybridization within or between pairs 
of complementary probes (goals 4, 5, and 6 above) was prevented by analyzing 4 base subsequences within the tags 
which were dynamically excluded when any of the following properties were met: 

(a) all tags with complementary regions of 4 or more bases, including those which overlap in sequence, and 4 
mers which are self complementary. To prevent tag variable sequences from hybridizing to the constant primer se- 
quences, regions of 4 or more bases in a tag which are complementary to 4 base sequences fully or partially contained 
in the constant sequence which were separated by at least 3 bases [i.e., the minimal separation typically required for 
hairpin formation) were prohibited. 

(b) To assure uniformity of hybridization strength, runs of 4 mers comprised only of 4 As, 4 Ts or 4 G or C residues 
werg prohibited Excluding runs of T/A and G/A is also desirable. 

Funher selection is optionally performed to refine aspects of the selection steps outlined above. For instance, to 
select tags which are less likely to cross -hybridize, more fixed or restricted bases can be added for alignment purposes, 
the tags can be lengthened, and additional coding requirements can be imposed. In one embodiment the tags selected 
by the method above is performed, and a subset of tags with reduced hybridization is selected. For example, a first 
tag from the set generated above Is selected, and a second tag Is selected from the tag set If the second tag does 
not cross-hybridize with the first tag ihe second tag remains in the tag set. If it does cross-hybridize, the tag is discarded. 
Thus, each tag from the group selected by the methods outlined above is compared to every other tag in the group, 
and selected or discarded based upon comparison of hybridization properties. This process of comparison of one tag 
to every other possible tag in a pool of tags is referred to as pairwise comparison. Similar to the 6teps outlined above, 
cross-hybridization can be determined in a dynamic programming method as used for the sequence alignment above. 

Refinement of the above methods Include accounting for the differences in destabilization caused by positional 
effects of mismatches In the probe:tag duplex, "me total number of mismatches is not the best estimate of hybridization 
potential because the amount of destabilization is highly dependent on both the positions and the types of the mis- 
matches; for example, two adjacent mismatched bases in a 20 nucleotide duplex is generally less destabilizing than 
two mismatches spread at equal intervals. A belter estimate of cross -hybrid fration potential can be obtained by com- 
paring the two tags directly using dynamic programming or other methods. In these embodiments, a set of tag se- 
quences obeying the following rules (in which the presence of a constant region in the tags is optional) is generated: 

(A) All tags are the same length, N and have similar base composition. Certain runs of bases and potential hairpin 
structures are prohibited (see above). 

(B) No two lag sequences contain an identical subsequence of length n, for 6ome threshold length n. The second 
rule allows fast screening of the majority of cross-hybridizing probes (the 6 lection method is linear), leaving a 
short list tor which each pair of probes Is compared for similarity (this takes time proportional to the square of the 
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number of probes). The method is ssentially an alphabetical tree s arch with the addition of an array to keep 
track of which n-mers have been used in previously generated lags. Each lime Ihe addition of a base lo Ihe growing 
tag creates an n-mer already used in a previous tag. the method backtracks and tries the next value of the base. 
(C) In this step, pairs of tags are compared to each other using a more sophisticated hybridization energy rule. 
For each pair of tags, the energy of hybridization of each tag to the complement f th ther is calculated. If the 
energy exceeds a certain threshold, one of the tags is removed from the list. Probes are removed until there are 
no pairs within the list exceeding Ihe threshold. For instance, in one embodiment, the energy rule is as follows: 
one point for a match (two adjacent matching base pairs), -2 points for errors In which a single base is bulged out 
on one 6trand. and -3 for all other errors including long and asymmetric loops. The highest scoring alignment 
between each pair of tags was determined using a dynamic programming algorithm with a preprocessor requiring 
a shon match of at least 5 bases to initiate comparison. 

More sophisticated energy rules can be incorporated in this framework by refining the hybridizalion rules. In ad- 
dition, more complex rules for calculating the hybridization energy can be used in any ol the above procedures. See. 
Vesnavere/af. (1389) Proc Natl Acad. Set. USA 86, 3614-361 B; Wetmur (1991) Critical Reviews in Biochemistry and 
Molecular Biology 26(3/4). 227-259. and Breslauer et al (1 986) Proc. NatL Acad. Sci. USA 83. 3746-3750. 

The set of tags chosen using pairwise comparison is not unique. It depends on the order of evaluation of the tags. 
For instance., if the tags are evaluated In alphabetical order : the final list contains more tags beginning with A than with 
T. Amore sophisticated approach to generating the largest maximal 6et of such tags can also be used involving Debruijn 
sequences (sequences in which every n-mcr tor some length n occurs exactly once). For cxamplo, a Debruijn sequence 
incorporating all n-mers could be divided into 20 mers with overlaps of n-1. giving the maximal number of 20 mere 
which do not have an nmer in common. This procedure is modified to take Into account the other goals for tags outlined 
above. For example, runs of bases are typically removed from the original Debruijn sequence, and 20 mers with im- 
balanced base composition or palindromes (which occur on a length scale greater than n) are deleted in a post-process- 
ing step. 

Many alternative procedures for Tag selection and exclusion are possible. For example, all pairwise energies can 
be computed before discarding any potential tags, and the tag which has the most near-matches exceeding the energy 
threshold can be discarded. The remaining tag with the most near-matches (not including those to tags which have 
already been discarded) can be discarded. This process is repeated until there are no near-matches remaining. 

For example, in one preferred embodiment of the above selection methods, the tags lack a constant region. Tags 
are selected by selecting all possible n-mers (e.g., 20-mers) : and eliminating sequences which: 

(i) have run6 of 4x where x is an A: T, C. or G [e.g., AAAA); 

(ii) have secondary structure in which a run of 4 contiguous nucleotides have a complementary matching run of 4 
nucleotides within the tag; or 

(iii) have a 9 base subsequence (or other selected number of subsequences, typically from 5 to 15) in common 
with any other tag. 

All tags are then selected to have the same GC content thereby providing all tags with similar melting temperatures 
when bound to a complementary probe. Performing the above steps limits the number of tags in the tag set to a pool 
of about 50.000 potential tags where the tags are 20-mers. 

A pair-wise selection strategy is then performed to yield a final set of tags. In the pair-wise comparison, a first tag 
is compared to every other tag in the tag set for hybridizalion to the complement of the first tag. If the first tag binds to 
a target with a hybridization threshold a selected value higher than every other tag in the potential set it is kept. If 
another tag in the potential tag set binds to the complement of the first tag with a hybridization energy above the 
selected threshold, the first tag is discarded. This process Is repeated for every tag remaining In the pool of potential 
tags. An exemplar computer program written In "C B is provided in Example 4 (Tags895.ccp) which performs the above 
selection steps, varying the threshold resulted in sets of 0 to 50,000 tags. In one preferred embodiment, 9,000 tags 
were generated. 

More generally, tags (or probes complementary to the tags) are selected by eliminating tags which cross-hybridfre 
(bind to the same nucleic acid with a similar energy of hybridization). Tags bind complementary nucleic acids with a 
similar energy ol hybridization when a complementary nucleic acid to one tag binds to another tag with an energy 
xceeding a specified threshold value, e.g., where one tag is a perfect match to the probe, a second tag is excluded 
if it binds the same probe with an energy of hybridization which is similar to the energy of hybridization of the perfect 
match probe. Typically If the second tag binds to the probe with about 80-95 % r greater the energy of a perfectly 
complementary tag, or more typically about 90-95 % or greater, or most typically about 95% or greater, the tag is 
discarded from the tag Get. The calculated energy can be based upon the slacking en rgy of various base pairs, the 
energy cost for a loop in the hybridized probe-tag nucleic acid chain, ancVor upon assigned values for hybridization of 
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base pairs, or on other specified hybridization parameters. In Exampl 2 below tags were selected by eliminating 
similar lags Trom a long lisl of possible lags, based upon hybridisation properlies such as assigned slacking values for 
tag:probe hybrids. 

Methods which do nol involve pairwise comparison ar also used. In one embodiment, the tags were selected to 
comprise a constant portion and a variable portion. The variable portion ot the tag sequence was limited 1o sequences 
which comprise no more than 1 C residue. The constant region of the tag sequence was chosen to be ^(ACTC^CC. 
This selection of specific sequences satisfies goal 7 above (the constant region was selected such that it is not self- 
complementary). One of skill will recognize that other constant regions can be selected, for instance where a primer 
binding site or restriction endonuclease site is incorporated into the tags. 

Goal 2 is also met, because the probe (which is complementary to the variable portion of the tag) has no contiguous 
region of hybridizing bases with the exception of a single AGT or TGA sequence present on some of the probes, and 
even these sequences are not adjacent to the variable region of the tag where the primary hybridization occurs. In 
order to meet goals (1) and (8) ; the tags are selected lo have the same length and the same total G+C content. 

To prevenl cross-hybridisalion belween a lag and probes complementary lo olher lags : a sel of lags which could 
not be aligned with less than two errors was chosen, wherein an 'error' is either a mismatch hybridization or an over- 
hanging nucleotide. This was done by fixing the sequence of the bases at the ends of the tags. In particular, the bases 
at the ends of the tags were constrained to be the same. The residues at the ends were constrained to be the same. 
In particular, the bases were selected to start at the 5* end with the residues GA, and to end at the 3* end with either 
an A or T residue, followed by a G residue. This arrangement prevents tags from matching probes with a single over- 
hanging nucleotide and no other errors, because an overhang forces cither a G-A or a G-T mismatch. This arrangement 
also prevents tags from matching with a single deletion error, because a single deletion would cause the probe and 
the tag to be misaligned at one end, causing a mismatch. One of skill will appreciate that this strategy can be modified 
\r. many ways to yield equivalent resuhs, e.o., by selecting bases to yield C-T or C-A mismatches. 

To prevent single mismatch errors in the tag-probe hybridization the next to last base from the 5' end was selected 
so that the number of As plus the number of Gs in the variable region is even (as noted above, the next to last base 
from the 5' end is either a T or an A). This base acts in a manner analogous to a parity bit in coding theory by requiring 
at least two differences exist between any two tags in the tag set. This is true because the GC content of all of the 
selected tags is the same {see, above): therefore, any base differences in the variable region have to involve the 
substitution of G and C residues, or T and A residues. However, the substitution of less than two bases leads to an 
odd number of G+A residues. Thus, at least two bases differ between any two tags in the tag set, satisfying (3) above. 
Similarly, the strategy could be varied, e.g., by selecting a different residue in the tag as the parity base which assigns 
whether the A+G content of the tag is even, or by adjusting the strategy above to yield an even number of T+G residues. 

A computer program in the standard programming language "C was written to perform each of the selection steps 
For completeness, the program is provided as Example 3 below, but it is expected that one of skill can design similar 
programs, or perform the selection steps outlined above manually to achieve essentially similar results Tags ccp uses 
a pruned tree search to find all sequences meeting the goals above, rather than testing every sequence of a selected 
length for the desired sequence characteristics. While this provides an elegant selection program with few processing 
steps, ore of skill will recognize that other programs can be used which test every potential tag for a desired sequence 
iags.ccp selects tag sets depending upon a variety of parameters including the constant sequence, the variable se- 
quence, the GC content, the goal that the A+G ratio be even, and the relationship of the constant and variable regions 
in the tags of the tag set. For instance, in one experiment, a constant and a variable region were selected The variable 
sequence length was selected to be 15 nucleotides in length, the G+C content of the variable region selected to be 7 
nucleotides, and the A+G total number of bases was selected to be even, with a pattern ot ??N„[AT]?. where * is a 
selected fixed base. The parameters resulted in a set of about 8,000 tag sequences. 

More generally, the problem of designing a set of tag sequences which do not cross-hybridize has strong similarities 
to the problem of designing error-correcting codes in coding theory. The primary difference Is that insertions and de- 
letions do not have a correlate in coding theory. This problem, Is accounted for as shown above by having constant 
regions within the probe sequences. The strategy is generalized by changing the location of the parity bit, the required 
parity, or the locations of the constant regions. More complex codes are also useful, for example codes which require 
more differences between pairs of tags See, Blahut (1983) Theory end Practice of Error Control Codes Addison- 
Wesley Publishing Company, Menb Park, CA. 

One of skill will also recognize that pairwise comparison methods are used in conjunction with any other selection 
method. For instance, the tags generated according to specified rules such as those Implemented by tags ccp can be 
further selected using any of the pairwise comparison methods described herein. 

Synth**!* ot Oligonucleotide Arrays 

Oligonucleotide arrays are selected to have oligonucleotides complementary to the tag nucleic acids described 
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above. The synthesis of oligonucleotid arrays generally is known. The development of very large scats immobiliz d 
polymer synlhesis (VLSIPS™) technology provides melhods (or arranging large numbers of oligonucleotide probes in 
very small arrays. Pirrung era/., U.S. Patent No. 5,143,854 (see Also PCT Application No WO 90/15070), McGall et 
a/., U.S. Patent No. 5,412,087, Chee ©fa/. SN PCT/US94/12305 : and Fodor et a/., PCT Publication No. WO 92/10092 
describe methods of forming vast arrays of oligonucleotides using, for example, light-directed synlhesis techniques. 
Seeatso, Fodor etal (1991) Science 25 1:767-777; Lipshutz tel. (1995) BbTechniques 19{3): 442-447; Fodor ef el. 
(1993) Nature 364: 555-556; and Medlin (1 995) Environmental Health Perspectives 244-246. 

As described above, diverse methods of making oligonucleotide arrays are known; accordingly no attempt is made 
to describe or catalogue ail known methods. For exemplary purposes, light directed VLSIPS"* methods are briefly 
described below. One of skill will understand that alternate methods of creating oligonucleotide arrays, such as spotting 
and/or flowing reagents over defined regions of a solid substrate : bead based methods and pin-based methods are 
also known and applicable to the present invention (See, eg., US Pat. No. 5,384,261 . incorporated herein by reference 
for all purposes). In the methods disclosed in these applications, reagents are typically delivered to the substrate by 
flowing or spotting polymer synlhesis reagents on predefined regions of the solid substrate. 

Light directed VLSIPS™ methods are found, e.g., in U.S. Patent No. 5,143,854 and No. 5,412,067. The light 
directed methods discussed in the '854 patent typically proceed by activating predefined regions of a substrate or solid 
support and then contacting the substrate with a preselected monomer solution. The predefined regions are activated 
with a light source, typically shown through a photolithographic mask. Other regions of the substrate remain inactive 
bocause they are blocked by the mask from illumination. Thus, a light pattern defines which regions of the substrate 
react with a given monomer. By repeatedly activating different sets of predefined regions and contacting different 
monomer solutions with the substrate, a diverse array of oligonucleotides is produced on the substrate. Other steps, 
such as washing unreacted monomer solution from the substrate, are used as necessary. 

The surface of a solid support is typically modified with linking groups having photolabite protecting groups (e.g., 
NVOC or MeNPoc) and illuminated through a photolithographic mask, yielding reactive groups (e.g., typically hydroxy! 
groups) in the illuminated regions. For instance, during oligonucleotide synthesis, a 3'O-phosphoramidite (or other 
nucleic acid synthesis reagent) activated deoxynucleoside (protected at the 5*-hydroxyl with a photolabile group) is 
presented to the surface and coupling occurs at sites that were exposed to light in the previous step. Following capping, 
and oxidation, the substrate is rinsed and the surface illuminated through a second mask, to expose additional hydroxy! 
groups for coupling. A second S'-protected, 3'O-phosphoramidite activated deoxynucleoside [or other oligonucleotide 
monomer as appropriate) is then presented to the resulting array. The selective photodeprotection and coupling cycles 
are repeated until the desired set of oligonucleotides is produced. 

In addition to VLSIPS 1 * arrays, other probe arrays can also be made. For instance, standard Southern or northern 
blotting technology can be used to fix nucleic acid probes to various substrates such as papers, nitrocellulose, nylon 
and the like. Because the formation of large arrays using standard technologies Is difficult. VLSIPS 71- arrays are pre- 
ferred. 

Meking Teg nucleic acids end Oligon ucleotides to be Coupled Into Arrays; Synthesis of Test nucleic ectds: 
Clonin g of Tea Nucleic Acids Into Cells ~ 

As described above, several methods for the synthesis of oligonucleotide arrays are know In preferred embodi- 
ments, the oligonucleotides are synthesized directly on a solid surface as described above. However, in certain em- 
bodiments, it is useful to synthesize the oligonucleotides and then couple the oligonucleotdes to the solid substrate 
to form the desired array. Similarly, nucleic acids In general {e.g., tag nucleic acids) can be synthesized on a solid 
substrate and then cleaved from the substrate, or they can be synthesized in solution (using chemical or enzymatic 
procedures), or they can be naturally occurring (i.e., present in a biological sample). 

Molecular cloning and expression techniques for making biobgical and synthetic oligonucleotides and nucleic 
acids are known in the art. A wide variety of cloning and expression and In vitro amplification methods suitable for the 
construction of nucleic acids are well-known to persons of skill. Examples of techniques and instructions sufficient to 
cfireel persons o! skill through many cloning exercises for the expression and purification or biological nucleic acids 
(DNA and RNA) are found in Berger and Klmmel, Guide to Molecular Cloning Techniques, Methods in Enrymofogy 
volume 152 Academic Press, Inc.. San Diego, CA (Berger); Sambrook etal (1989) Molecular Cloning -A Laboratory 
Manual {2nd ed.) Vol. 1-3, Cold Spring Harbor Laboratory. Cold Spring Harbor Press. NY. (Sambrook); and Current 
Protocols in Molecular Biology F.M Ausubel et a/., eds. t Current Protocol a joint venture between Greene Publishing 
Associates, Inc. and John Wiley & Sons. Inc., (1994 Supplement) (Ausubel). Nucleic acids such as Tag nucleic acids 
Can be cloned into ceils (thereby creating recombinant tagged cells) using standard cloning protocols such as those 

described in Berger, Sambrook and Auebsi. 

Examples of techniques sufficient t direct persons of skill through in vitro methods of nucleic acid synthesis and 
amplification of tags and probes In solution, Including enzymatic methods such as the polymerase chain reaction (PCR), 
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the ligase chain reaction (LCR) Qp-replicase amplification (QBR). nucleic acid sequence based amplification (NASBA), 
strand displacement amplification (SDA), the cycling probe amplification reaction (CPR), branched DNA (bDNA) and 
other DNA find RN A polymerase mediated techniques are known. Examples of these and related techniques are found 
in Berger, Sambrook. and Ausubel. as well as Muilis ©fa/., (19B7) U.S. Patent No. 4.683,202; PCR Protocols A Guide 
to Methods and Applications (Innis etal eds) Academic Press Inc. San Diego. CA(1990) (Innis); Arnheim & Levinson 
(October 1, 1990); WO 94/11383; Vooijs etal. (1993) Am J. Hum. Genet. 52; 586-597; CAEN 36-47; The Journal Of 
NIH Research (1991) 3, 81-94; (Kwoh etal. (19B9) Proc. Natl Acad. Sci USA 36, 1173; Guatelli etal. (1990) Proc. 
Natl. Acad. Sci. USA 87 : 1874; Lomell et el. (1989) J. Clin. Chem 35, 1826; Landegren et at, (1988) SclenceZA}.. 
1077-1080; Van Brunt (1 990) Biotechnology '3, 291-294; Wu and Wallace, (1989) Gene 4. 560; Sooknanan and Malek 
(1995) Bio/Technology 13, 563-564; Walker etal. Proc. Natl Acad. Set. USA 89, 392-396), and Barringer etal (1990) 
Gene 89, 117. Improved methods of cbning In vitro amplified nucleic acids are described in Wallace ef a/., U.S. Pat. 
No. 5,426 ; 039. In one preferred embodiment, nucleic acid tags are amplified prior to hybridization with VLSIPS™ arrays 
as described above. For instance, where tag nucleic acids are cloned into ceils in a cellular library the tags can be 
amplified using PCR. 

Standard solid phase synthesis of nucleic acids is also known. Oligonucleotide synthesis is optionally performed 
on commercially available solid phase oligonucleotide synthesis machines (see. Needham-VanDevanter etal. (1984) 
Nucleic Acids Res. 12:6159-6168) or manually synthesized using the solid phase phosphoramidite triester method 
described by Beaucage et al. (Beaucage et. al. (1981) Tetrahedron Letts, 22 (20): 1859-1862). Finally as described 
above, nucleic acids are optionally synthesized using VLSIPS™ methods in arrays, and optionally cleaved from the 
array. The nucleic acids can then be optionally reattached to a solid substrate to form a second array where appropriate, 
or used as tag nucleic acids where appropriate, or used as tag sequences for cloning into a cell. 

Labels 

The term "label* refers to a composition detectable by spectroscopic, photochemical, biochemical, immunochem- 
ical, or chemical means. For example, useful nucleic acid labels include 32P, 35S, fluorescent dyes, election-dense 
reagents, enzymes (e.g., as commonly used in an ELISA), biotin. dioxigenin, or haptens and proteins lor which anlisera 
or monoclonal antibodies are available. 

A wide variety of labels suitable for labeling nucleic acids and conjugation techniques are known and are reported 
extensively in both the scientific and patent literature, and are generally applicable to the present invention for the 
labeling of tag nucleic acids, or amplified tag nucleic acids for detection by the arrays of the invention. Suitable labels 
include radionuclcotidcs, enzymes, substrates, cofactors, inhibitors, fluorescent moieties, chcmilumincsccnt moieties, 
magnetic particles, and the like. Labeling agents optionally include e.g., monoclonal antibodies, polyclonal antibodies, 
proteins, or other polymers such as affinity matrices, carbohydrates or lipids. Detection of tag nucleic acids proceeds 
by any known method, including immunoblotting, tracking of radioactive or bioluminescent markers, Southern blotting, 
northern blotting, southwestern blotting, northwestern blotting, or other methods which track a molecule based upon 
size, charge or affinity. The particular label or detectable group used and the particular assay are not critical aspects 
of the invention. The detectable moiety can be any material having a detectable physical or chemical property. Such 
detectable labels have been well-developed in the field of gels, columns, solid substrates and in general, labels useful 
in such methods can be applied to the present invention. Thus, a label Is any composition detectable by spectroscopic, 
photochemical, biochemical, immunochemical, electrical, optical or chemical means. Useful labels in the present in- 
vention include fluorescent dyes (e.g.. fluorescein isothkxyanate, Texas red, rhodamine : and the like), radiolabels (e. 
g., 3H, 1251, 35S, 14C, or 32P), enzymes (e.g., LacZ ; CAT, horse radish peroxidase, alkaline phosphatase and others, 
commonly used as detectable enzymes, either as marker gene products or in an ELISA), nucleic acid intercalators (e. 
g., ethidium bromide) and colorimetric labels such as colbidal gold or colored glass or plastic (e.g. polystyrene, poly- 
propylene: latex, etc.) beads. 

The label is coupled directly or indirectly to the desired nucleic acid according to methods well known In the art. 
As indicated above, a wide variety of labels are used, with the choice of label depending on the sensitivity required, 
ease of conjugation of the compound, slabilily requirements, available instrumentation, and disposal provisions. Non 
radioactive labels are often attached by Indirect means. Generally, a ligand molecule {e.g., biotin) is covalently bound 
to a polymer. The ligand then binds to an anti-ligand (e.g., streptavidin) molecule which is either inherently detectable 
or covalently bound to a signal system, such as a detectable enzyme, a fluorescent compound, or a chemiluminescent 
compound A number of ligands and antHigands can be used. Where a ligand has a natural antHigand, for example, 
biotin, thyroxine, and Cortisol, it can be used in conjunction with labeled, antHigands. Alternatively: any haptenic or 
antigenic compound can be used In combination with an antibody. Labels can also be conjugated directly to signal 
generating compounds, e.g., by conjugation with an enzyme or fluorophore. Enzymes of interest as labels will primarily 
be hydrolases, particularly phosphatases, esterases and grycosidases, r oxidoreductases, particularly peroxidases. 
Fluorescent compounds Include fluorescein and Its derivatives, rhodamlne and Its derivatives, dansyl. umbelllferone, 
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etc. Chemiluminescent compounds Include lucrferin : and 2,3-dihydrophthalazlnediones, e.g., lumlnol. Means of do- 
lecling labels are well known to those of skill in Ihe art. Thus, lor example, where Ihe label is a radioaclive label, means 
for detection include a scintillation counter or photographic film as in autoradiography. Where the label is a fluorescent 
label, it maybe detected by exciting the fluorochrome with the appropriate wavelength of light and detecting the resulting 
fluorescence, e.g.. by microscopy visual inspection, via photographic film, by the use of electronic detectors such as 
charge coupled devices (CCDs) or photomultipliers and the like. For detection in VLSIPS™ arrays, fluorescent labels 
and detection techniques, particularly microscopy are preferred. Similarly, enzymatic labels may be detected by pro- 
viding appropriate substrates for the enzyme and detecting the resulting reaction product. Finally simple colorimetric 
labels are often detected simply by observing the color associated with the label. Thus, in various dipstick assays, 
conjugated gold often appears pink, while various conjugated beads appear the color of the bead. 

Substrates 

As mentioned above, depending upon Ihe assay. Ihe lag nucleic acids, or probes complemenlary lo lag nucleic 
acids can be bound to a solid surface. Many methods for immobilizing nucleic acids to a variety of solid surfaces are 
known in the art. For instance, the solid surface is optionally paper, or a membrane (e.g.. nitrocellulose) a microtiter 
dish (e.g.. PVC, polypropylene, or polystyrene), a lest tube (glass or plastic), a lipstick (e.g. glass, PVC, polypropylene 
polystyrene, latex : and the like), a microcentrifuge tube, or a glass, silica, plastic, metallic or polymer bead or other 
substrate as described herein. The desired component may be covalently bound, or noncovalently attached to the 
substrate through nonspecific bonding. 

A wide variety of organic and inorganic polymers, both natural and synthetic may be employed as the material for 
the solid surface. Illustrative polymers include polyethylene, polypropylene. poly(4-methylbutene) polystyrene 
polymethacrylate, polyethylene terephthalate), rayon, nylon, polyvinyl butyrate), polyvinylidene difluoride (PVDR 
silicones, polyf ormaldehyde, cellulose, cellulose acetate, nitrocellulose, and the like. Other materials which are appro- 
priate depending on the assay include paper, glasses, ceramics, metals, metalloids, semiconductive materials cements 
and the like. In addition, substances that form gels, such as proteins (e.g., gelatins), lipopolysacchar id es.' silicates 
agarose and polyacrylamides can be used. Polymers which form several aqueous phases, such as dextrans poly- 
alkylene glycols or surfactants, such as phospholipids, long chain (12-24 carbon atoms) alkyl ammonium 'salts and the 
like are also suitable. Where the solid surface is porous, various pore sizes may be employed depending upon the 
nature of the system. * K 

In preparing the surface, a plurality of different materials are optionally employed, e.g., as bminales to obtain 
various properties. For example, protein coatings, such as gelatin can be used to avoid non specific binding simplify 
covalent conjugation, enhance signal detection or the like. If covalent bonding between a compound and the surface 
is desired, the surface will usually be polyfunctions or be capable of being polytunctionalized. Functional groups which 
may bo present on the surface and used for linking can include carboxylic acids, aldehydes, amino groups cyano 
groups, ethylene groups, hydroxyl groups, mercapto groups and the like. In addition to covalent bonding various 
metnods for noncovalently binding an assay component can be used. 

EXAMPLES 

The following examples are provided by way of illustration only and not by way of limitation. Those of skill will 
readrfy recognize a variety of noncritical parameters which could be changed or modified lo yield essentially similar 
results. ' 

Example 1: Parallel Analysis of Deletion SlrainsotS. cerevisie. 

The complete sequence of the S. cerevisiae genome is known. In the process of sequencing the genome, thou- 
sands of open reading frames, representing potential genes or gene fragments were identified. The function of many 
or Ihese ORFs is unknown. 1 

Gene disruption is a powerful tool for determining the f unction of unknown ORFs in yeast. Given the sequence of 
an ORF, it is possible to generate a deletion strain using standard gene disruption techniques. The deletion strain is 
Jen grown under a variety of selective conditions to identify a phenotype which reveals the function of the missing 
ORF. However, individual analysis of Ihousands of deletion strains to assess a large number of selective conditions is 
impractical. 

To overcome this problem. Individual ORF deletions were tagged with a distinguishing molecular tag The deletion 
specific tafls were read by hybridization to a high density array of oligonucleotide probes comprising probe sets com- 
plementary to each tag. 

The molecular tagging strategy Involves a four-step approach for generating tagged deletion strains that can be 
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pooled and analyzed in parallel through selective growth assays. 

Individual deletion strains were generated using a PCR-largeling strategy (Baudin. Ozier-Kalogeropoulos et at. 
19.93 Nuc. Acids Res. 21(14): 3329-3330). ORF specific molecular tags were Incorporated during the transformation 
(Figure 2). Tagged deletion strains were pooled and representative aliquots grown under different selective conditions. 
The molecular tags were amplified from the surviving strains and hybridized to a high-density array containing com- 
plements to the tag sequences (Figure 2). The array was then washed and scanned using a highly sensitive confocal 
microscope. The normalized signal for each tag reflects the relative abundance of the different deletion strains in the 
pool. The fitness of the deletion strains in the pool was determined by comparing the hybridization patterns obtained 
before and after the selective growth. 

To test the feasibility of the molecular tagging strategy, a list of 9,1 05 unique 20mer tag sequences was generated 
using the computer program tags.ccp (See, be/owand Table 1). 



Table 1 





Selection criteria 


Properties of 
selected 20mers 


Number of 20mers accepted 


All Possible 


None 




1.2x10 1z 


20m ers 








Primary Filter 


No hairpins > 4bp 


Similar Tm(+A7 8 C) 






No single 


Good hybridization 


! 51,081 




nucleotide runs > 4 








bases 








Similar base 


No extreme 






composition (4-6 


homology 






of each base) 








No common 








9mers 






Secondary Filter 


Low stringency 
pair-wise analysis 


Unique 


51,081 

9,105- > (4,500 selected at random) 

2,643 
853 
170 




Highest stringency 


Very unique 


42 




comparisons 







A 1 28 cm X 1.28 cm array comprising probes complementary to the tag sequences was produced by standard 
tight-directed VLSIPS™ methods. The resulting high-density array of probes provides probe sets at known locations 
in the array. Fluorescence imaging using a scanning confocal microscope permitted quantitation of the hybridization 
signals for each of the 4,500 sets of 20mers on the array (Figure 1). Hybridization experiments with 120 different 
fluorescently labeled 20mer oligonucleotides showed that the arrays are sensitive, quantitative, and highly specific. 

As part of a feasibility study, tagged deletion strains were generated for eleven characterized auxotrophic yeast 
genes (ADE 1 , ADE2, ADE3 : ADE4, ADE5, AROA. AR07, TRP2. TRP3, TRP4, and TPR5) using the strategy described 
in Figure 2. The oligonucleotides used to generate the deletion strains are described in Figure 3 and transformation 
results are shown in Figure 4. 

The strains were pooled and grown in complete media and different drop-out medias. Genomic DNA extracted 
from the pool served as a template for an asymmetric tag amplification using a pair of primers homologous to common 
regions flanking each tag (Figure 4). Depletion of specific strains from the pool was quantitatively measured by hybrid- 
izing the amplified tags to the high-density arrays (Figure 6A-C). 

Example 2: A Method for S elect in q Tags From a Pool of Tage 

Tags (or probes complementary to the tags) are selected by eliminating tags which bind to the same target with a 
similar energy of hybridization. Tags bind complementary nucleic acids wtth a similar energy of hybridization when a 
complementary nucleic acid to ne tag binds to another tag with an energy exceeding a specified threshold value. The 
calculated energy is based upon, e.g., the stacking energy of various base pairs, and the energy cost for a loop in the 
chain, and/or upon assigned values for hybridization of base pairs, or on other specified hybridization parameters. In 
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this example, tags were selected by eliminating similar tags from a long list f possible tags, based upon hybridization 
properlies such as assigned slacking values for lag.probe hybrids. 

Probecmp was written to create a list of non -similar tags from a long list of tags. Tags are considered to be similar 
it a perfect match to one tag binds to another tag with an energy exceeding some specified threshold Probecmp 
incorporates three distinct ideas in selecting tags. The ideas are: 

1 ) a stacking and loop cost model for the calculation of hybridization energy; 

2) algorithms to quickly calculate that energy, including a recursive, heavily pruned algorithm, and a dynamic 
programming algorithm; and ' 

3) a hash table to quickly find perfect malch segments. 

The stacking and bop cost model lor the calculation of hybridization energy 

The calculated energy is based on the slacking energy of various base pairs, and the energy cost for a loop in the 
chain. Eg , the user can specify that the energy from an TA stacking is 2. and the GC and CG is 4 and AC AG TC 

?L 7 .I 3 ' , Wi,h ,h6Se Va ' UeS ' A 6 G T A C G is worth 3*4+3+2+3+4 = 19. The energy cosl for loops is given by 
a matrix of the loop size on each strand: " noy 
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For example, if the following malch occurs, there is a loop site of 1 on the firsl strand and 0 on the second strand 
Examination of the table reveals a loop penalty of 5. and a resulting stacking energy of 14. 



target: AG GT A CG 
tag: T C C A T G C 



The Algorithms to Quickly Calculate the Energy of Hybridization 

^"^fca'fcnenergycan be calculated us 
The recursive algorithm is fast if .he energy cos. for bops is large relalL to the Sg^TT^^ 
grammmg algorithm is fast if the ensrgy cost for loops is small relative to the stacking energy P 



CTA 
AAACA 



GjAA 
AJTT 



GATAAGTACC 
CTAACAGG 



N | Perfect match region 



Both energy calculation algorithms start out with 2 tags which have a multiple base perfect match sequence Thev 
the perfect match region, and after the perfect match region. The total matching energy is (he sum of these Three 

maS ener^: *" I* ^ " "»» used ^bo h «h blfore 

match energy, and the after match energy, by reversing the orders of the before match fragments 
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A recursive, heavily pruned algorithm 

The recursive algorithm tries all match trees, looking at all loop sl7es that have a low enough cost that they 
be paid for with the maximum possible energy for the remaining matches. Code for the algorithm is as follows: 



/* linkSet is a structure listing locations of matching bases, and the corresponding energy. */ 
static linkSet local * calculateEnergyAndLinksFromPointOn( short pp, short tp, char local * tag, 
char local "target, short baiiesLefllnTag, short basesLeninTHrget){ 
register short i, j; 

float tempEnergy, bestEnergy = 0; 
linkSet *en, *returnValue « NULL; 
short maxTagLoop; 

maxTagLoop = min( basesUftlnTag, nwxLegalLoop + 1); // only check legal loop sizes 
for( i - 0; i < maxTagLoop; i+ +){ // i is loop size along tag 

// bother with loop is an array that has precalculated the minimum number of base pairs 
needed to pay for that loop si*e. 

fort j = 0; min(basesLeftInTag - i, basesLeftlnTargct - j) > holherWithLooprnrjl; j + + M 
// j is loop size along target 

ifl (togf PP + i + 1) &- target[tp + j + !])){ // (his tests that the tag malches the 

target 

en - calculateEnergyAndLiiiksFrornPomtOnfpp+i+l, tn+j + 1 ta ff . 
target, basesLeftlnTag -i-1, basesLeftlnTarget j-1); ' 

// if this is the last match, and the stacJcing energy pays for the loop, and is 
better than nur previous best.... * 

if( en = = NULL Sc& (tempEnergy = 
(stack±igEneTgy[bvget^ > bestEnergy ){ 

returnVulus - m«kcNwLinksct( retumValuc ); 
rcturnValuc-> energy = bestEnergy - tempEnergy; 
tf( i » » 0 <fc& j 0){ // add self and successor as link 
firstLink( return Value, 2, pp, tp ); 
}else{ // add successor as link 

firstlink( reruraVaJue, 1, pp+i+1, tp+j + 1); 
addNewLink( retumValuc, 1, pp, tp ); 

M onft > eIse i^C (i= =0) &A fj == = &)){ // new matching base pir is adjacent to 

u ir to stacking energy pays for the loop, and is better than our 

previous best.... 

iti[ (tempEnergy - cn-> energy I 
*UcbngEnei^[targ^ > bestEnergy){ 

en- > energy = bestEnergy = tempEnergy; 
makeFir$tLinkOneLonger( en ); 

if( return Value ! = NULL ) farfree< return Value ); 

return Value — cn; 
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}clse // new energy to small, don't want it, just free it. 
if(en! = NULL)ferfrcc(cn); 
}eise{ // loop size is non-zero 

// if the stacking energy pays for the loop, and is better than ur 

previous best.... 

if( (tempEnergy « en- > energy + 
stackingEnergy[target[^ > bestEnergy){ 

en- > energy = bcstEnergy - tempEnergy; 
if( returnValuc !« NULL) farfree< returnValue ); 
retumValue = en; 
addNcwLink( retumValue, 1, pp, tp ); 
}clse // new energy to small, don't want it, just free it. 
if( en !*= NULL ) farfree( en ); 

} 

> 

} 

return retumValue; 

} 

A dynamic programming algorithm 

The dynamic programming algorithm starts out by making a matrix of permitted or "legal" connections between 
the two fragments. 



CTAGAAGAT 



AAACAATTCTAACAGG 



AAGTACC 



s | Perfe ct match region 

Then, starling at the upper left corner, each legal connection is considered, and all previous bases are identified. 
Previous bases are any legal base pair in the rectangle to the upper left of the considered base. In the figure below, 
the first Ihree legal matches have no previous legal connection except the (assumed) perfect match which occurs 
before the mismatch segment. 
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The values in those cells are replaced by the sum of the stacking energy, and the loop cost. 
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(continued) 
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Then the process is repeated for each legal connection further out in the matrix. If there is more than one legal loop 
the loop with the maximum value is used. When this process is finished, path that resulted in the maximum cell value 
is the best match. 
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A hash table to quickly find perfect match segments 

To speed up the comparison of one tag with all other 
to all occurrences of any given n-mcr in the set of tags, 
which point to locations in tags. The first array is 4 to the 
list of tags. 



tags in the list, the program uses a hash table which points 
The hash table is implemented as two arrays of structures 
n records long, and the second is the size of the complete 
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to the complement of any other tag with a stacking energy which exceeds a specified threshold. 
Example 3: "tags.ccp" 

The computer program tags.ccp. written in "C and referred to above is provided below 

^include <sldio.h> 

include <stdlib.h> 

^include <math.h> 

^include <alloc.h> 

char outg = ^08.0171"'; 

char houtQ = "HIST.OUT"; 

char Iabel[) = "ACGT"; 

#define BASES 4 //number uf nucleotides 

//the following numbers include all the nonperiodic bases at the 

//end of the primer (in this case one C) in addition to the lag 

#define PRETAG 0 //bases of primer preceding tag 

#define LENGTH 20 //length of tag plus pretag 

#define GCLIM 1 1 //max total of G's and Cs allowed 

fdefine ATLIM 11 //max total of A's and T's allowed 

#define AALJM 7 //max total of A's allowed 

#define CCUM 6 //max total of Cs allowed 

Wefine GGLIM 6 //max total of G*s allowed 

^define TTLIM 7 //max total of T's allowed 
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^define NUMERS 256 //number of fourreers 

^define LMAT 10 //length of long matches prohibited 

^define LNUM 1048576L //number of longmers 

^define MXNUM ( 32768 / si ieof(int)) 

short numVectors; 

intfar ** IngVectors; 

int test_base(int currect_base); 

int remove_basejdata(int currcnt_base); 

int complcmcct(int fourxner); 

double seq_toJat(void); 

int pnntsequence(void); 

int pattem(LENGTH)[BASES]; //identifies allowed bases each position 
int fourroers[NUMERS]; //identifies prohibited 4mers 

int sequencefLENOTH]; //buss at specified position 
int fourseq[LENGTH] ; //founners ending at specified position 

long longseq[LENGTH] ; //longmers ending at specified position 

int basecnt[BASES]; //number of occurrences of each base to left of 

//currentj>osition 

int spacings[LENGTH]; //posiUon past (or =) to which an occurrence of 

//a complementary fourmer to the one at this 
//position is prohibited 

FILE +outfde t *houtrile; 
ma in 0 



"it ijjc,cbs; //counters 

long temp,c_b; //temp storage oflongraer 

int cseq[LEKGTHl - {2,0,0,0,1,0,0,0,2,0,2,2,2,3,2}; //sample tag 
long xpas; //number of passes thru loop over 4 
long maxlag; //ma* number of different tags 
int current_base; //current sequence position within backtracking 
int tag_count; //count of acceptable tag sequences 
f 11 * pass; //flag for whether tag was acceptable 

int exflag; //flag indicating completion of all possible tags 

double cur_seq; //integer representation of current tag sequence 
double prev_seq; //integer representation of previously tested tag 

int histogramlLENGTH J ; //histogram of mismatches to cseq 
int mismatches; 
numVectors =» (LKUM/MXNUM); 

IngVectors = (int * *) rarcalloc(numVectors, sizeof(int far *)); 
ifflng Vectors == NULL) { 

printf("\ncrror allocating IngVectors! '); 

cxit(l); 

} 

for(i=0; i< numVectors; -f +i) { 

lngVectors[i] * (int * ) farctUoc(MXNUM f rizeof(int)); 
if(lngVectors[i] == NULL) { 

printfCVnerror allocating vectors: i s %d\n",i); 

exit(l); 

} 

} 

//open output file 
outfde « fopen(out,"wf ); 
if(NULL«=-outrlle){ 
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printf("\ncrror opening output file %s\n",out); 
cxit(l); 

} 

//open houtput file 

houtfile = fbpenCiout/wt^i 

if(NULL = =houtfile){ 

print{("Vnerror opening output file %s\a\hout); 

fclosc(outfile);exit(l); 

) 

//initialize histogram 

for(i-0; i< LENGTH; + +i) histogramp] ~ 0; 
//initialize pattern amy 

for(i=0; i< LENGTH; + +i) for(k=0; k< BASES; + -fk) pattero[i)[k] 
//initialize 4mer array 

for(i=0; i<NUMERS; + +i) fourraers[i] = 1000; 
//initialize 9mer array 

for(i=0; i<LMAT; -f +i) ln fl Vectors[i/MXNUM][i 9SMXNUM] « 0; 
//Runs marked 

fourmersJO] = fourm ers[ 85] = founners[170] = fourmers[Z55] = -1; 
//initialize sequence 

for(i=0; i< LENGTH; + +i) sequence(il = 0; 

//initialize spacings to disregard incomplete words at beginning 

for(i=0;i<3; + +i) spacingsfi) = 1000; 

//initialize spacings to require 6mer palindrome 

for(i=3; i<LENGTH; + +i) spacingsfi] = i+2; 

//initialize base sequence, current position, base counl.tag count etc. 

//start at "largest * sequence we know is wrong 

sequenceJO) = -1; 

current_base = 0; 

basccnt[0] = basecnt[l] = baaecnt[2] = basecnt[3] = 0; 
tag_count = 0; 
prev_seq = -1; 

//initialize fourseq for fourrners up to current_base 
//don't need to since it is at zero and it will be reset 
//initialize fourrners for fourrners up to currentjiase 
//don't need to do anything because only one is at zero and 
It it will be removed shortly anyway 
//calculate maximum number of different tags 
for(maxtag = J.i^O; i < LENGTH-PRETAG- 1 ; + +i) { 
nmlflg *= BASES; 

> 

printf("\nmaxtag *= %ld\n%maxtag); 

//THIS IS A DEBUG STATEMENT!! REMOVE LATER 
max tag = (long) 10000000000; 

pass = 0; 

//until all probes are exhausted 

for(j=0, xpas=0,cxflag«0; exflag 1 = 1 &&xpas<maxtag; ++j) { 
ifG«-4){ 

r I xpas; 
j-0; 

} 



23 



EP 0 799 897 A1 



//DEBUG 

//fprintf<outfile/\nNOW: '); 
//print_sequence(); 

//backtrack to last incremeotable base 

for(; sequ«cc[currwtj»se] == BASES- 1 1 1 (pass = = } current b&se>l 

--^urrcntbasc) { 

//remove current base data 
reraovc^b»_<Uu(currtnt_basc); 
//set base to zero 
3cquence[curfeat_base] = 0; 

if(current_base < PRETAG) { 
exflag = 1; 
continue; 

) 

//remove current base data 

reroove_base_data(currentJ>a5e); 

//increment current_base 

+ +sequence{current_base]; 

//error checking to ensure sequence is increasing 

//DEBUG 

//rprintf(outfile/\nUPD: "); 
//print_sequence(); 
cur_scq = seq_tojnt(); 
if(cur_seq < = prevjseq) { 
pnnt_sequeoce(); 

... , . printfl["\n\n!!ERRROR: current J>ase = %d } cur seq = %d, prev seq 

*d\n\current_base,cur_seq.prev_seqj); ,F 
fdose(ourfile);exit(l); 

^ prevjseq = cur acq; 

//update base data and test until reach end or failure 

for(pass - 1; cumnt_base < LENGTH &£ pass = = !; + + current base) 

pass = l«l_basc(current__baije); 
//if testing was successful, print sequence 
-current__base: 

//extra check to be sure not repeating longroers 

itfpass « I) { 

//record prohibitions on 9mers 
for(cbs = LMAT-I; cbs < LENGTH; + +cbs) { 
cb - Iongseqfcbs]; 

tf(lngVectors[cJ>/MXNUMJ[c bSMXNUM] != 0) { 

%Id, currenLbase = %d. longs*, = w ^* m ™ tm >* * "urchin* 9mer found: , 

cj>, cbs, longseqfcbs]); 
fcIose{outfile);exjt(J); 

) 

//record longmer prohibitions 
Mcbs = LMAT-1; cbs < LENGTH; + +cbs) { 
c_b = lungseqlcbs]; 

lrigVectors[c_b/MXNUM][c_h %MXNUMJ= I ; 
//increment tag count 
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+ +tag_count; 
//print tag sequence 
rprintti(outflle/\n"); 
priotjsequenceO; 
/* //calculate mismatches with ref sequence, cseq 

mismatches •» 0; 

for(i=0; i< LENGTH; + +i) if(cseq|i| !« sequenced) + +mismatches; 
4- +histogramfmlsButchesJ; 
if(mismatches c = 2) { 

rprintf(houtfilOn"); 

for(i-PRETAG; i< LENGTH; + +i) 
fprintffhoutfile/jtc'^labelfsequeacelij]); 

} 

♦/ 

//rprintf(outfile." OK!-); 

} 

> 

//if exited on j > -MAX record error and exit 
ifG>=maxtag) 

printf("Vn!!ERROR: exceeded allowable passes through primary 1oop\n"); 
//print tag count 

fyrinir(Gutrde/\n\nTag Count: fcd\n\lag_courjt); 

for(i =0; i < LENGTH; + +i) rprintf(houtfile,'\n»d mismatches: *d\i,histogn*rn[il); 

fclose{outfile); 

fc!ose(houtfile); 

return(l); 

} 

iot test base(int current base) 
i 

int comp; //complement 
long c_b; 

//increment base count (must be accomplished before exiting this function) 
4 +basecnt[sequencercurrent_base]|; 
//fprintf{outfile/Vnbase = %d " ( currcntjjase); 
//test base count, return upon failure 
if(basecnl[0] + hasecnt[3) > ATLIM) { 

//rprintf(outfile."feiIcd AT limit"); 

rerum(0); 

} 

if(basccul[l] + basecot[2] > GCUM) { 

//fprintf(outfile, •failed GC limit"); 
retum(O); 

> 

if(basecnl|0| > AALIM) { 

//fprintf(outfile, "failed A limit"); 
return(0); 

> 

if(b*MC«![1] > CCUM) { 

//fprimf(outfiIe, "frited C limit'); 
rehift(O); 

) 

if(basecntl2J > GGLIM) { 

//fprintf(outfile, "failed G limit"); 
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return(O); 

) 

if(basecnt[3] > TTLIM) { 

//fprint((outfi!c,"lijled T limit"); 
roturn(O); 

} 

//test if base matches patterns, return upon failure 
ifCpatterarcurTent^baseJlsequeccelcurrent ba^J] ! = 1) { 

//fprintf(outfile, •failed pattern match'); 

retum(O); 

} 

//if al last base verify checksum 
if(currcnt base = = LENOTH-i) 

if(0 == ((basecnl[0]+basecnt[2])%2)) { 

//fprintf(outfile, "failed checksum'); 

return(O); 

} 

//compute current 4mer 

iftcurnmtjjase ==0) fourseq(O) = sequence[current base]; 

else fourseqlcurrent^base] « (4*fourseqfcurren^base-"l])%256 + sequencefcurrenl base]; 
//if this is a full 4mer check 4mer and return upon failure 
if(current_base > 2) { 

comp - cornplernent(fourseq|current_base]); 
if(fourmers[comp] < = currenl_base) { 

//fprintf(outfile/fourmer failed: curt: %d, comp: %d, spacing: %d\ 
// fourseqfcurren^basej^comp.fourmerslcompl); 
rerum(0); 

) 

) 

//compute current 9mer if full 9mer 

if(currentj>ase = = 0) longseqJO] = sequencejcurrcntjwsej; 

else longsec.[current _base] - (4 ♦Jongseq [current base^])%LNUM + sequence[curreni bavel 
//if full longraer check 9raer and return upon failure 
if(currentj)ase > LMAT-2) { 

c_b = longseq[current_baseJ; 

if(lngVcctors[c_b/MXNUM][c_b%MXNUM] = = I) { 
//printfChello"); 

//rprintffoutFile/longmer failedx b: %ld, IngVectorsTc bl; 
LbJngVectors[c>D; ~~ 
return (0); 

} 

} 

//record prohibitions on 4mer 

if(current_base > 2 &Sc fourmcrs[fourseq(current base]] > spacingslcurrent base]) 

fourmersIfourseq[current_base]] = sparin~gs[cunentj>ase]; 
//all tests passed!! Add new base: 
//Dprintf(outfile. "passed! 
//passed all tests 
retum(l); 



remove_basejlata(int current base) 



JO 
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int i; 

int curt; //current fourtner 
if(se<iuence[currentj>ase3 < 0) retum(l); 
//calculate complement 
curt = fourseq[currcnt_base]; 
//remove current 4mer prohibitions 
if(fourmers[curt] > -1) fourmers[curtl » 1000; 
//reinstate prohibitions for previous copies of this 4tner 
for(i=0; i<currcnt_base; + +i) { 
if(fourseq[i] = = curt) { 

if(fourmersIcurt] > spacings[ij) fourmers[curt] = spacings[i]; 

> 

} 



//adjust base count 

-basecnt[sequence[current_base]]; 

return(l); 

20 ) 

int complement(int fourrner) 
{ 

int i; 
int comp; 
25 //assuming fourmers in four bases 

for(i=0 t comp=0; i<4; + +i) { 
comp *= 4; 

comp + = 3 -(fourrner % 4); //add complement of base 

fourrner /= 4; 

30 } 

return(comp); 

} 

double seq_U)_int() 
{ 

35 int i; 

double seq_int = 0; 
//assumes 4 bases 

for(i=?RETAG; i< LENGTH; ++i) { 
seq^int += 4; 
40 scqjint + = sequence^]; 

//printf("\n % Id' .seqjurt); 

> 

return(ReqJnt); 
45 int print_sequence0 

{ 

int i; 

for(i=PRETAG; i< LENGTH; + +i) rprinttXoutfile,"%c B ,labeI[sequence[iJ]); 
retum(l); 



Example 4: Taqa895.ccp 

A preferred method ol selecting probe arrays involves discarding all probes from a pool of probes which have 
identical 9 mer runs (thereby eliminating many tags which will cross-hybridize), followed by pair-wise comparison f 
the remaining tag nucleic acids and elimination of tags which hybridize to the same target. An exemplar program 
(Tags895.ccp) written In "C Is provided below. " 
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TO 



35 



40 



45 



SO 



55 



^include <stdio.h> 
^include <stdlib,h> 
^include <math.h> 
^include <alJoc.h> 

char outQ = "TAGS. OUT"; 
char houtD = "HTST-OUT"; 
char label[] = "ACGT"; 



^define BASES 4 //number of nucleotides 
//the following numbers include all the nonperiodic bases at the 
//end of the primer (in this case one C) in addition to the tag 
#define PRETAG 0 //bases of primer preceding tag 
,s define LENGTH 20 //length of tag plus pretag 
^define GCLDvf 11 //max total of G's and Cs allowed 
^define ATUM 11 //max total of A's and T's allowed 
^define AALIM 7 //max total of A's allowed 
^define CCUM 6 //max total of C's allowed 
20 if define GGLIM 6 //max total of G's allowed 
define TTLTM 7 //max total of Ts allowed 
^define NUMERS 256 //number of fourmers 
Jdefine WAT 10 //length of long matches prohibited 

#define LNUM 1048576L //number of longmers 
25 #dcfine MXNUM ( 32768 / sizeof(int)) 

short nurnVectors; 
int far •* IngVectors; 

30 int test_base(int current^base); 

int rcmove_base_data(int current_base); 
in I complement(int founner); 
double seqJo_int(void); 
int print__sequence(void); 



'int Pattem[I^NGTO][BASES]; //identifies allowed bases each position 

2 fourmersfNl^S]; //idcntifics prohibited ^ s 

mt sequcoccfLENGTH] ; //base at specified position 

int fourseqtLENGTH]; //fourmers ending at specified position 

long //longmers at s^fiedS 

int basecnPASES]; //number of oc^ntneinf eachtoeto left of 

int spacings[LENGTH]; //position past J^^T^occu^ of 

//a complementary founner to the one at this 
FILE *outr,le,*ho U tfile; //p ° Siti ° n * pr ° Wbited 

mainO 
{ 

jn' ij.k.cbs; //counters 

long xpas; //number of passes thru loop over 4 
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long maxtag; //max number of different tags 

int currentjwse; //current sequence position within backtracking 

int tag_count; //count of acceptable tag sequences 

^ P^J //flag for whether tag was acceptable 

M exfla « : //flag indicating completion of all possible tags 

double curj*q; //integer representation of current tag sequence 

double prev_seq; //integer representation of previously tested tag 

tot histogramlLENGTH]; //histogram of mismatches to cseq 

int mismatches; 

oumVectors « (LNUM/MXNUM); 

IngVectors => (int + *) farcaIIoc(num Vectors, sizeof(int far *)): 
ifflngVectors NULL) { 

printf( rt \nerror allocating Ing Vectors! 
exit(l); 

) 

for(i=0; i<num Vectors; + -H){ 

IngVeclorsIi] = (int ♦ ) farcallocXMXNUM, sizeoffint)); 
ifangVectorsri] = = NULL) { 

prmtftTWror allocating vectors: i = #d\n\i); 
exit(l); 

} 

} 

//open output file 

outflie - fopen(out > • , wt ,, ); 

if(MJLL " outfile) { 

printf("\nerror opening output file %s\n\out): 

} 

//open houtput file 

houtfiie = fopen(hout,*wT); 

if(NULL = = houtfile) { 

printff\neiTor opening output file KsVn'.hout); 
fc!ose<outffle);exit(i); 

//initialize histogram 

fcffl-0; i< LENGTH; rustograrnfi] « 0; 
//initialize pattern array 

for(i=0; i< LENGTH; + + i) for(k=0; k< BASES; + +k) pattem[i][k] « I; 
//initialize 4mer amy 

for(i=0; i<NUMERS; fouiwersfi] = 1000; 
//initialize 9mer amy 

for(i=0; i<LMAT; + bgVectors[i/MXNUM][i%MXNUM1 = 0; 
//Runs marked 

fourraersfO] - fourmerst85J « fbunners[l70J fourmers[2551 = -1- 

//initialize sequence ' 

for(i=0; i< LENGTH; + +i) sequcnceli] » 0; 

//initialize spacings to disregard incomplete wo'rds at beginning 

for(j=0; i<3; + +i) spacin^sfi] «= 1000; 

//initialize spacings to require 6mer palindrome 

for(j=3; i<LENGTH; + +i) spacingsfi] « i+2; 

mibaJize base sequence, current position^ count.tag count etc 
//start ul largest - sequence we know is wrong 
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sequence^] « -1; 
curTCflt^baxe «= 0; 

hasecnio] = basecnt[l] basecnt[2] = basecntf3J = 0; 
5 tag^count = 0; 

prcv_seq - -J; 

//initialize foureeq for fourmere up to current J>ase 

//don't need to since it is at zero and it will he reset 
™ //initialize fourmers for founners up to currcntjiase 

//don't need to do anything because only one is at zero and 

// it will be removed shortly anyway 

//calculate maximum number of different tags 

fbr(raaxtag = l,i=0; i<LENGTH-PRETAG-l; ++i) { 
is max tag +- BASES; 

) 

printf("\nmaxtag «= 9&ld\n",maxtag); 

//THIS IS A DEBUG STATEMENTl! REMOVE LATER 
20 max tag = (long) 10000000000; 

pass - 0; 

//until all probes arc exhausted 

for(j=0, xpas=0,exflag=0; exflag ! = 1 && xpas<maxtag; + +j) { 

■h+xpas; 
j=0; 

) 

*/ 

30 //DEBUG 

//fprintf(outfile/\nNOW: «); . 

//print_sequence0; 

//backtrack to last incrementable base 

for(; sequence[current_base] = = BASES- 1 ) | (pass= = l && current__base>7); 

3S --curreat^base) { 

//remove current base data 
removc_basc_data(currcnt_base); 
//set base to zero 
sequecce[current_basel = 0; 

40 } 

if(currcntj>ase < PRETAG) { 
exflag = 1; 
continue; 

45 //remove current base data 

rernove_base_da ta(current_base); 

//increment currtntjjase 

-f +sequenoefcurreut_base]; 

//error checking to ensure sequence is increasing 
so //DEBUG 

//fprintffoutfrte/VnUPD: "); 

//print^sequenceQ; 
/* curseq = seq_toJnt(); 

ifl[cur_seq <= prevjeq) { 
55 print^sequenceQ; x 
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printf('\n\n!!ERRROR: currcn^hasc = %d, cur seq « $6d, prev seq = 5&d, j 
%d\n\cunrcnt_base.cur_scq,prev_seqj); 

fclose(outfile);exit(l); 

5 > 

prev scq = curscq; 

*/ 

//update base data and test until reach end or failure 
for(pass - 1; current J»se < LENGTH &£pass = = I; + + current J>ase) 
to pass = te«t_btse{current_base); 

//if testing was successful, print sequence 
— currentbase; 

//extra check to be sure not repeating longrners 
if(pass \) { 

is //record prohibitions on 9mers 

forfcbs = LMAT-l; cbs < LENGTH; + +cbs) { 
c_b = longseq[cbs]; 

ifOngVectorsIc^WMXNUMKc^b^MXNUM] != 0) { 

prinl f( " \n\n ! ! ERROR: already matching 9mer found: c b » 

20 % Id, currentbase = fcd, longseq = %|d", 

c_b, cbs, longscq[cbs)); 
fcIose(outfiIe);exit(l); 

} 

} 

25 //record longmer prohibitions 

for(cbs - LMAT-I; cbs < LENGTH; +4 cbs) { 

cj> = longseq[cbs]; 
^ lngVectors[c_b/MXNUM][c_b 96MXNUM] - 1 ; 

30 //increment tag count 

+ +tag_count; 
//print tag sequence 
rprinlf(outnie,*\n'); 
print_sequenceO; 

3$ f /calculate mismatches with ref sequence, cseq 

mismatches = 0; 

for(i=0; i< LENGTH; + + i) if(cseq[i] i= sequence[i1) + + mismatches; 
+ +histograra[mismatchesJ; 
if(mismalches < « 2) { 
40 fprinir(houtfile,"\n"); 

Mi = PRETAG;i<LENGTH; + +i) 
fjprintf(houtfile, " %cMabeHsequence{iJJ); 

} 

•/ 

45 //rprinlftToutfile/ OK!'); 

) 

//if exited on j> = MAX record error and exit 
if(j>=maxtag) 

printf^VnlJERROR: exceeded allowable passes through primary loop\n'); 
//print tag count 

rprintf(outfiIe/\nVnTag Count; %d\n\tag count); 

for(i»0; i< LENGTH; + +i) fprintf(houtfik,nn%d mismatches: %d\i\ histogram!*]) • 
** fcIose{outfiIe); 

fclose(houtfi!e); 
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remtn(l); 

} 

inttest_base(int current base) 

{ 

int comp; //complement 
long c_b; 

//increment base count (must be accomplished before exiting this function) 
+ + basecnt[seo^ence[current_base]]; 
//rprintf(outnle/\iibase = %d H ,current_base); 
//test base count, return upon failure 
if(basecnt[0] + basecnt[3] > ATLIM) { 

//fprintffoutfile. "failed AT limit"); 

return(0); 

} 

if(basecnt[l] + basecnt[2] > GCUM) { 

//fprintf(outfiIe, "failed GC limit"); 
return(0); 

} 

if(basecnt[0] > AALIM) { 

//fprintf(outfile, "failed A limit"); 
retum(0); 

} 

if(basecnt[l] > CCLIM) { 

//fprintffoutme/feiled C limit"); 
retura(0); 

} 

if(basecnt[2] > GGLIM) { 

//fpriDtfl[outfiIe, "failed G limit"); 
return(0); 

} 

if(basecnt[3] > TTLIM) { 

//rprintf(outfiIe,Tailcd T limit"); 
rcturn(0); 

} 

* 

//test if base matches patterns, return upon failure 
if(pattem[airrent_base]^ 1= 1) { 

//<5pnntf(outfiJe, "railed pattern match*); 

rcturo(0); 

} " 

//if at last base verify checksum 
if{current_base = = LENGTH-1) 

if(0 == ((basecnt[0]+basecnt[2J)%2)) { 

//fprintf(outfile/failed checkfium"); 

return(0); 

f ^ 

//compute current 4mer 

if(current_base = = 0) fburseqfO] = sequence(current base]; 

else foiir^rcux^^ase] = (4«Weq[eunaitbaseTlD%256 + sequencelcum^t base]; 
//if this is a full 4mer check 4raer and return upon failure " 
if(uurrent_base > 2) { 

comp = coraplen^t(fhiirseqIcurrent baseD; 

ifCfourmersfcomp] < = current^basef { 

//lprintf(outme,"foiirnieTrwled:curt: %d, comp: %d, spacing: %d\ 
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// fouis«q[current_bsLse] ,eomp , fbunnersleotnpj); 
retum(0); 

} 

} 

//compute current 9mcr if full 9mcr 

ifCcurreat^base 0) Ion$seq[0J - 5equeocc[currcnt base]; 

else Iongseq[currei)t J>ase] « (4*Iongseq[current - bftsJll)%L14UM + sequcocefcurrent base]- 
//if fulllongmer check 9raer and return upon failure 
if(current_base > LMAT-2) { 

c_h = longseqfcurrcntjjase]; 

if(lnfiVectors[c^b^MXNUMJ[c_h%MXNUM]= = l){ 
//printffhello'); 

//fprintfCoutfiie/longmerfailedic b: %ld, InffVectorsfc bl: 
55dV>togVectors[c_b]); 

return(0); 

} 

//record prohibitions on 4mer 

iffcurrentjwse > 2 &&. fourmers[fburscq[currentJasc]l > spacings[currcnt base]) 

foiirmersffourswifcurrcci^base]] = spacin$s[current base]; 
//all tests passed!! Add new base: 
//fprintf(outfile, "passed! *); 
//passed all tests 
retum(l); 

} 

bt remove J>ase_data(int currcnt_basc) 
int i; 

int curt; //current fourmer 
if(sequecce[curroit base] < 0) return(l); 
//calculate complement 
curt = fcnjr$cq[cuirentjjase]; 
//remove current 4mer prohibitions 
irTfourmersfcurt] > -1) fourmerslcurt] « 1000; 
//reinstate prohibitions for previous copies of this 4rner 
for(i=0; i< current J>ase; + +i) { 
if(fourseq[i] = = curt) { 

^ if(fourmerslcurt] > spacmgs[ij) fourrners[curt] = spacing]; 

} 

//adjust base count 

--basecntfsequencelcurrent^basell; 
return(l); 

} 

!ntcomplernent(int fourmer) 

int i; 
int comp; 

//assuming fourmers in four bases 
forfi=0, comp=0; i<4; ++i) { 
comp *= 4; 

comp + - 3-(fburmer %4); //add complement of base 

fourmer /= 4; 

> 
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retum(corop); 

} 

double seq_to intQ 

{ 

int i; 

double scq_int « 0; 
//assumes 4 bases 

for(i=PRETAG; i< LENGTH; + +0 { 
seqjnt *= 4; 
seqjnt 4-= sequence^]; 
//primf("\n%ld\seqjni); 

return (swj^int); 

int printeequeaceO 
{ 

int i; 

for(i=PRETAG; i< LENGTH; + +i) ft)rintf(outfile t "»c' p ltbel[sequeace[i]]); 
retnra(l); 

} 

A truncated output list is provided below: 

♦♦/sample output fjle for tags895.cpp /** 

AAACAAACACCCGCGTGGTT 

AAACAAAGACCCGCCGGTGT 

AAACAAATACCCGCCGTGGG 

AAACAACAACCCGCGTGTGG 

AAACAACCAACCCGGTGTGG 

AAACAACGAACCCGCTGGTG 

AAACAACTAACCCGCTGTGG 

AAACAAGAACCCGCCGTTGG 

AAACAAGCAACCCGGCGTGT 

AAACAAGGAACCCGCCTCGT 

AAACAAGTAACCCGCCTTTG 

AAACAATAACCCGCGCTGGG 

AAACAATCAACCCGCTTGGG 

AAACAATGAACCCGCGTCGG 

AAACAATTAACCCGCGTTCG 

A A AC A CAA ACCCGGCTGGTG 

AAACACACAACCCGTGGTGG 

AAACACAGAACCCGCTTTGG 

AAACACATAACCCGGCGGTG 

AAACACCAAACCCGTTGTGG 

AAACACCCAAACCGTTGTGG 

AAACACCGAAACCCTGTGGG 

AAACACCTAAACCCTTGTGG 

AAACACGAAACCCGGTCGGT 

AAACACGCAAACCCGGTGGT 

AAACACG G AAA CCCTCG GTG 

AAACACOTAAAOCCGTCGOT 

AAACACTAAACCCGTGOGGT 

AAACACTCAAACCCTGGTGG 

AAACACTGAAACCCGTCTGG 

AAACACTTAAACCCGTTCGG 
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AAACAGAAACCCGCTCGGTG 
AAACAGACAACCCGGCTTGG 
AAACAGAGAACCCGGCCTTG 
AAACAGATAACCCGCTCTTG 
AAACAGCAAACCCGTGGCGT 
AAACAGCCAAACCGCGTGGT 
//many lines removed here // 

Tag Count: 14507 



All publicalions and patent applications cited, in Ihis specification are herein incorporated by relerence lor all pur- 
poses as if each individual publication or patent application were specifically and individually indicated to be incorpo- 
rated by reference. 

Although the foregoing invention has been described in some detail by way of illustration and example for purposes 
of clarity of understanding, It will be readily apparent to those of ordinary skill in the art in light of the teachings of this 
invention that certain changes and modifications may be made thereto without departing from the spirit or scope of the 
appended claims. 



Claims 



1. A method of selecting a set of tag nucleic acids with minimal cross hybridization to a nucleic acid, said method 
comprising providing a list of tag nucleic acids : and excluding nucleic acids from the list of tag nucleic acids which 
cross hybridize to a single complementary nucleic acid under stringent conditions, thereby providing a set of se- 
lected tag nucleic acids with minimal cross hybridization to the nucleic acid. 

2. A method of claim 1 , wherein the method of selecting tag nucleic acids further comprises: 

selecting a first tag nucleic acid from the list of tag nucleic acids; 

selecting a second tag nucleic acid from the list of tag nucleic acids; optionally tags not being selected if they 
have more than 8 contiguous nucleotides in common with any previous tag; 

comparing the sequence of the first tag nucleic acid to the sequence of the second tag nucleic acid and 
determining that the second tag nucleic acid hybridizes to the complement of the first tag nucleic acid with a 
selected thermal binding stability, thereby excluding the second nucleic acid from the selected set of nucleic 
acid tags; 

optionally the method comprising rejecting or selecting each tag in the list of tags In order or rejecting or selecting 
tags in complementary pairs, wherein each selected tag has a complementary selected tag. 

3. A method ot claim 1 or claim 2, wherein the method further comprises selecting a thermal binding stability for the 
tags, and excluding all tag nucleic acids from the list ot tag nucleic acids which do not have the selected thermal 
bndng stability, preferably thermal binding stability being selected by specifying a ratio of G+C to A+T nucleotides 
for the tag nucleic acids, and specilying a length for the tag nucleic acids. 

A method of any one of claims 1 to 3, wherein the method further comprises excluding tags which contain serf- 
commentary regions from Ihe list of lags, preferably regions of selfcomplemenlarily greater lhan 4 nucleotides 
in length. 



4. 



5 " fn me w^ ^ ° ne ° f daimS 1 10 4 ' Whefein lha taQS are between 1 5 and 30 nucleotides in length or between 
10and 100 nucleotides in length, preferably 20 nucleotides in length. 

€. A method of any one of claims 1 to 5, wherein the method further comprises selecting a complementary probe 
nucleic acid for tags in the selected tag set, wherein each tag sequence is complementary t one probe sequence 
and the thermal binding stability between each tag and each complementary probe is substantially similar pref- 
erably all of the tags having the same length and the same GC to AT ratio. 
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7. A method of any on© of claims 1 to 6. wherein the method further comprises selecting a constant region subse- 
quence shared by all lag nucleic acids, Ihereby determining Ihe nucleotide position of variable nucleotides in the 
tags: optionally the method further comprising providing a set of probe nucleic acids by deteirnlnlng the complement 
to each variable nucleotide in each tag nucleic acid and selecting a probe comprising a corresponding comple- 
s rnentary nucleotide for each nucleotide in the variable tag sequence, which probe does not hybridize to Ihe constant 

region of the tag nucleic acid, thereby providing a selected set of probes. 

6. A method of any one of claims 1 to 7, wherein the method further comprises removing tag nucleic acids which 
have fewer than five, and preferably fewer than two, nucleotide differences when aligned tor maximal sequence 
to correspondence: preferably " ^ 

the total number of nucleotides in each of the selected sets being identical; 
the number of G+C nucleotides in each tag in the selected set being identical; and, 
the overall number of A+G nucleotides in each of the variable regions of the lags being even. 



is 
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9. A method of any one of claims 1 to S. wherein tags which contain 4 contiguous nucleotides selected from the group 
consisting of 4 X residues, 4 Y residues and 4 Z residues, are eliminated from the tag set, wherein X is selected 
from the group consisting of G and C : Y is selected from the group consisting of G and A, and Z is selected from 
the group consisting of A and I 

10. A composition comprising a set of tag nucleic acids : which set of tag nucleic acids comprises a plurality of tag 
nucle«c acids, which tag nucleic acids comprise a variable region, optionally comprising less than two C nucleotides; 

which variable region for each tag nucleic acid in the set of tag nucleic acids has the same T m . the same G + C 
to A+T ratio, the same length and does not cross-hybridize to a probe nucleic acid, preferably the variable 
region of the tag nucleic acids from the set of tag nucleic acids comprising an even number of A+G nucleotides 
and, 

wherein the tag nucleic acids in the set of tag nucleic actfs cannot be aligned with less than two differences 
between any two of the tag nucleic acids in the set of tag nucleic acids; 

and preferably wherein the tags also comprise a constant region. 

11. A method of labeling a composition, comprising associating a tag nucleic acid with the composition, wherein the 
* lu^anSy a s C imilar T ^ * 9r ° UP * ^ nUC,eiC ^ * n0 * ^^"^ and which have a 

12. A method of claim 11, further comprising detection of the lag nucleic acid. e.g. by labeling the nucleic ackJ and 
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to hybridize to the group of tag nucleic acids 

13 ' IS? ° f Cl > aim "if da !T 12 fUrth0f com P risin 9 ampliation of the tag nucleic acids, thereby providing am- 
plrfied tag nucle,c acris and detection of the amplified tag nucleic acids by hybridization to an array of ofobee 

ZnZZ* ,0 ,he teg nuc,eic eg wherein ,he ,ag nuc,eic acWs ' ■» 

14. A method of claim 13. wherein the amplified tag nucleic acids are labeled with a fluorescent label. 

15. A method of preselecting experimental probes in an oligonucleotide probe array, wherein the probes have sub- 
slannally un.Iorm hybndualion properties and do not cross hybridise, comprising- 

selecting a ratio of G + C to A+T nucleotides shared by the experimental probes in Ihe array 

def erminrng all possible 4 nucleotide subsequences for variable nucleic acids in the probes of the array and 

exc.ud.ng all probes from the array ** teh ^ 

se li^Zl^TT"?* n T ucleo,ide sub8oquence8 are se,ec ^™ * *?iE55 

Z^SSSS^ * P 4 Pr ° beS ' Pr0b6S ' "* Pf0beS ""*™"»* * —I* 
optionally wherein the method further comprises 
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selecting a length for the probes In the array, thereby providing selected length probes; 

selecting a constant region subsequence shared by aO selected length probes in the array, thereby determining 

the nucleotide position of variable nucleic acids In the probes of the array; and 

providing that the overall number of A+G nucleotides in the probes of the array is even. 

16. A method of claim 15, wherein the method further comprises selecting control probes for addition to the array. 

17. A method of detecting a plurality of nucleic acids in a sample, comprising 

(i) providing an array of experimental oligonucleotide probes, which probes do not cross hybridize under strin- 
gent conditions : wherein the ratio of G+C bases in each probe is substantially identical: 

wherein the probes of the array are arranged into probe sets in which each probe set comprises a ho- 
mogeneous population of oligonucleotide probes; preferably the nucleic acids comprising tag sequences 
which tag sequences bind to the probes of the array; 

(ii) hybridizing said array of oligonucleotides to the sample under stringent hybrid i7ation conditions, optionally 
ihe probes of the array specifically hybridizing to at least one nucleic acid in the sample; and 

(Hi) detecting hybridization of the nucleic acids to the array of oligonucleotide probes. 

18. An array of oligonucleotide probes comprising a plurality of experimental oligonucleotide probe sets attached to a 
solid substrate, wherein 

each experimental oligonucleotide probe set in the array hybridizes to a different target nucleic acid under 
stringent hybridization conditions; 

each oligonucleotide probe in the probe sets of the array comprises a variable region; and wherein 
the nucleic acid probes do not cross-hybridize in the array. 

19. An array of claim 1 8 further defined by any one or more of the following features (a) to (f):- 

(a) each probe set in the array having a constant region, wherein the variable region does not cross hybridize 
with the constant region under stringent hybridization conditions; 

(b) each probe set in the array differing from every other probe set in the array by the arrangement of at least 
two nucleotides in the probes of the probe set; 

(c) the ratio of G+C bases in each probe for each experimental probe set being substantially identical; 
\Zscc7 aV C ° mpriSlng 3 p,Ura,fty ° f Pr ° be S6tS S9lecled ,rom the out P ut 9roup of probes produced by running 

(e) the array further comprising a nucleic acid bound to a probe in the array; and 

(f) the array further comprising control probes. 

20 " l^iSS'I nUC,GiC aCk3 COmpri6lnfl Pr0Vidinfl a popu,ation of nuc,eic to an array of oli- 

gonucleo .de probes and monrtormg hybridization of the test nucleic acids to the probes in the array wherein the 

s 3 ^ 

Pf0be ^ ^ ^ h ' b ' to to * -leic -id under 

fh^rS^ 16 ^ 6 Pf0be in thG Pr0b ° 5etS ° f the array ^P" 505 vafiab,e "Sion; optionally the probes of 
Ihe array comprising a constant region, wherein the variable region does not cross hybridize wrth the constan 
region under stringent hybridization conditions; and wherein 
the nucleic acid probes do not cross-hybridize in the array; 

a nucleic acid complementary to the contr I probe to the array. nyonaizino. 
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21. A plurality of recombinant cells comprising tag nucleic acids selected from a set ol tag nucleic acids, which set of 
lag nucleic acids comprises a plurality ol lag nucleic acids, which lag nucleic acids comprise a variable region; 

which variable region for each tag nucleic acid in the set f tag nucleic acids has the same T m . the same G+C 
to A+T ratio, the same length and does not cross-hybridize; optionally wherein the tags further comprise a 
constant region, wherein the variable region does not cross hybridize with the constant region under stringent 
hybridization conditions; and 

wherein the tag nucleic acids in the set of tag nucleic acids cannot be aligned with less than two differences 
between any two of the tag nucleic acids in the set of tag nucleic acids. 

22. A recombinant cell of claim 21 which ls:- 

(a) selected from a library of genetically distinct recombinant ceils; 

(b) a eukaryotic cell; 

(c) a prokaryotic cell; or 

(d) a yeast cell. 

23. A kit comprising an array of oligonucleotides, wherein 

the array of oligonucleotide probes comprises a plurality of experimental oligonucleotide probe sets attached 
to a solid substrate; 

each experimental oligonucleotide probe set in the array hybridizes to a different target nucleic acid under 
stringent hybridization conditions; 

each oligonucleotide probe in the probe sets of the array comprises a variable region, optionally each oligo- 
nucleotide in the array further comprising a constant region, wherein the variable region does not cross hy- 
bridize with the constant region under stringent hybridization conditions; and 
the nucleic acid probes do not cross-hybridize in the array. 

24. A kit of claim 23, wherein the kit further comprises:- x 

(a) a plurality of tag nucleic acids complementary to the experimental oligonucleotide probes in thearray;andbr 

(b) control oligonucleotide probes; and/or 

(c) PCR reagents, a container and instructions 

25. The use of nucleic acid sequences as tags for components of a library such as to permit component identification 
by tag hybridization to a probe array. 
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20mer Array 

Hybridized with control oligonucleotide 
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W-SXTSH ADE 1 HCMOLCCV PRIKI2CS SITE 

t7p-«troatt flanking- JQatr 
AAAGAAXfcCACAXACXAMT 
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Figure 3 
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Tag Analysis 




Population of 11 different tags 




Figure 5 
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